[SERVER-26305] Deadlock between replication stepdown and threads about to wait for read concern Created: 23/Sep/16  Updated: 25/Jan/17  Resolved: 13/Oct/16

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.3.14
Fix Version/s: 3.4.0-rc1

Type: Bug Priority: Critical - P2
Reporter: Kaloian Manassiev Assignee: Andy Schwerin
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-26339 Stepdown waiters are signaled twice w... Closed
Related
is related to SERVER-26345 Make threads waiting for writeConcern... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding 2016-10-10, Sharding 2016-10-31
Participants:
Linked BF Score: 0

 Description   

Interrupting operations at replication stepdown acquires first the Client lock and then the replication coordinator mutex like shown in this thread:

 [2016/09/22 23:14:59.994] (Inline Function) --------`-------- mongod!std::lock_guard<std::mutex>::{ctor} 0x10
 [2016/09/22 23:14:59.994] 00000000`0953d9b0 00000001`4052d8d3 mongod!mongo::repl::ReplicationCoordinatorImpl::interrupt 0x2b
 [2016/09/22 23:14:59.994] 00000000`0953da40 00000001`4052d6e7 mongod!mongo::ServiceContext::killOperation 0x1a3
 [2016/09/22 23:15:00.011] 00000000`0953dad0 00000001`404474ae mongod!mongo::ServiceContext::killAllUserOperations 0xb7
 [2016/09/22 23:15:00.011] 00000000`0953db40 00000001`404472a5 mongod!mongo::repl::ReplicationCoordinatorImpl::stepDown_nonBlocking 0x15e
 [2016/09/22 23:15:00.011] 00000000`0953dca0 00000001`404694bc mongod!mongo::repl::ReplicationCoordinatorImpl::stepDown 0x85
 [2016/09/22 23:15:00.011] 00000000`0953dd20 00000001`3ff7cd76 mongod!mongo::repl::CmdReplSetStepDown::run 0x3dc
 [2016/09/22 23:15:00.011] 00000000`0953def0 00000001`3ff78892 mongod!mongo::Command::run 0x646
 [2016/09/22 23:15:00.011] 00000000`0953e450 00000001`404c1770 mongod!mongo::Command::execCommand 0xb22

Threads, which are about to wait for read or write concern first hold the replication coordinator mutex and then acquire the Client lock, like shown in this thread:

 [2016/09/22 23:14:59.895] (Inline Function) --------`-------- mongod!std::lock_guard<mongo::Client>::{ctor}+0x13
 [2016/09/22 23:14:59.895] (Inline Function) --------`-------- mongod!mongo::OperationContext::waitForConditionOrInterruptNoAssertUntil::__l2::<lambda_cff863bf3f2e525e5f5bbda6979e07b5>::operator()+0x13
 [2016/09/22 23:14:59.895] 00000000`0eb4c2f0 00000001`4018eaf2 mongod!std::condition_variable::wait<<lambda_cff863bf3f2e525e5f5bbda6979e07b5> >+0x33
 [2016/09/22 23:14:59.895] 00000000`0eb4c320 00000001`4018e864 mongod!mongo::OperationContext::waitForConditionOrInterruptNoAssertUntil+0x1e2
 [2016/09/22 23:14:59.895] 00000000`0eb4c3c0 00000001`404490c6 mongod!mongo::OperationContext::waitForConditionOrInterruptNoAssert+0x54
 [2016/09/22 23:14:59.895] 00000000`0eb4c450 00000001`4077a825 mongod!mongo::repl::ReplicationCoordinatorImpl::waitUntilOpTimeForRead+0x426
 [2016/09/22 23:14:59.895] 00000000`0eb4c710 00000001`4076ff92 mongod!mongo::ShardLocal::_exhaustiveFindOnConfig+0x135
 [2016/09/22 23:14:59.910] 00000000`0eb4ca60 00000001`4071ba16 mongod!mongo::Shard::exhaustiveFindOnConfig+0x122
 [2016/09/22 23:14:59.910] 00000000`0eb4cbb0 00000001`40723a04 mongod!mongo::ShardingCatalogClientImpl::_exhaustiveFindOnConfig+0xd6
 [2016/09/22 23:14:59.910] 00000000`0eb4cd40 00000001`406f0955 mongod!mongo::ShardingCatalogClientImpl::getAllShards+0x134
 [2016/09/22 23:14:59.910] 00000000`0eb4d110 00000001`406d194d mongod!mongo::ClusterStatisticsImpl::getStats+0x75
 [2016/09/22 23:14:59.910] 00000000`0eb4d590 00000001`406c48bc mongod!mongo::BalancerChunkSelectionPolicyImpl::checkMoveAllowed+0x5d
 [2016/09/22 23:14:59.910] 00000000`0eb4d9b0 00000001`404da050 mongod!mongo::Balancer::moveSingleChunk+0x7c
 [2016/09/22 23:14:59.910] 00000000`0eb4db10 00000001`3ff7ce3a mongod!mongo::`anonymous namespace'::ConfigSvrMoveChunkCommand::run+0xc0
 [2016/09/22 23:14:59.910] 00000000`0eb4de30 00000001`3ff78892 mongod!mongo::Command::run+0x70a
 [2016/09/22 23:14:59.910] 00000000`0eb4e390 00000001`404c1770 mongod!mongo::Command::execCommand+0xb22

This results in deadlock when replication stepdown tries to interrupt threads. The ordering didn't use to be a problem before this change, which made replication threads use the generic OperationContext interruptable waits.



 Comments   
Comment by Githook User [ 13/Oct/16 ]

Author:

{u'username': u'andy10gen', u'name': u'Andy Schwerin', u'email': u'schwerin@mongodb.com'}

Message: SERVER-26305 Use interruptible condition variables in ReplicationCoordinatorImpl instead of KillOpListener

While this change also improves the readability of _awaitReplication_inlock and
stepDown, it resovles SERVER-26305 by breaking a deadlock cycle caused by the
fact that KillOpListener methods get run under a mutex in ServiceContext.
Branch: master
https://github.com/mongodb/mongo/commit/9228a2b401b4af0adfa53a61053fba3a7df4f75c

Comment by Githook User [ 13/Oct/16 ]

Author:

{u'username': u'andy10gen', u'name': u'Andy Schwerin', u'email': u'schwerin@mongodb.com'}

Message: SERVER-26305 Use interruptible condition variables in ReplicationCoordinatorImpl instead of KillOpListener

While this change also improves the readability of _awaitReplication_inlock and
stepDown, it resovles SERVER-26305 by breaking a deadlock cycle caused by the
fact that KillOpListener methods get run under a mutex in ServiceContext.
Branch: master
https://github.com/mongodb/mongo/commit/9228a2b401b4af0adfa53a61053fba3a7df4f75c

Comment by Githook User [ 06/Oct/16 ]

Author:

{u'username': u'benety', u'name': u'Benety Goh', u'email': u'benety@mongodb.com'}

Message: Revert "SERVER-26305 Use interruptible condition variables in ReplicationCoordinatorImpl instead of KillOpListener"

This reverts commit 360d6ae39305dbc2f41ca7e0bfd424081fd4f030.
Branch: master
https://github.com/mongodb/mongo/commit/c421cc52e2112e7d0ef22e822dfdf0e5d6186b92

Comment by Githook User [ 05/Oct/16 ]

Author:

{u'username': u'andy10gen', u'name': u'Andy Schwerin', u'email': u'schwerin@mongodb.com'}

Message: SERVER-26305 Use interruptible condition variables in ReplicationCoordinatorImpl instead of KillOpListener

While this change also improves the readability of _awaitReplication_inlock and
stepDown, it resovles SERVER-26305 by breaking a deadlock cycle caused by the
fact that KillOpListener methods get run under a mutex in ServiceContext.
Branch: master
https://github.com/mongodb/mongo/commit/360d6ae39305dbc2f41ca7e0bfd424081fd4f030

Comment by Andy Schwerin [ 26/Sep/16 ]

They aren't yet interruptible. I deferred that work, but could revive it if needed.

Comment by Kaloian Manassiev [ 26/Sep/16 ]

Turns out that this is not such an easy task because the killOperation call also needs to invoke the listeners, but its contract is to call it with the client mutex acquired.

However, now that I am looking at ReplicationCoordinatorImpl::interrupt I am thinking that since wait for read/write concern are now interruptible, we no longer need it so we could just get rid of this method altogether.

Comment by Eric Milkie [ 26/Sep/16 ]

Sounds good, if we have the list at construction time.

Comment by Kaloian Manassiev [ 26/Sep/16 ]

How about we just initialize the list of killOp listeners at construction time? That way we do not need to expose a lever to disable registering new ones.

Comment by Eric Milkie [ 26/Sep/16 ]

If that's true, I'd like to see us remove the locking of the mutex entirely from setKillAllOperations(). Strangely, killOperation() already does not lock the mutex, even though it iterates through the listener list as well.
Also, we should permit the ServiceContext to be set into a mode that prevents registerKillOpListener() from working, thus making it safe to observe the listener list without locking the mutex. We could set the mode after the last registerKillOpListener at startup.

Comment by Kaloian Manassiev [ 26/Sep/16 ]

But this vector should not be changing during the runtime of the process. It should be instantiated once at service startup and not modified after that. Is there any scenario which requires killop listeners to change?

Comment by Eric Milkie [ 26/Sep/16 ]

I don't see how you can notify the killoplisteners outside of the lock, as the killoplisteners is just a vector that is protected by the service context mutex.

Comment by Kaloian Manassiev [ 26/Sep/16 ]

I was looking at the different ways to fix this and I think the most appropriate one is to ensure that the client lock is not being held while notifying the killOp listeners. The operation contexts would still be killed under the clients list lock. The sequence would be:

Lock the service context's clients list
Produce a list of opIds, which need to be killed
Mark all OperationContexts as killed
Unlock the service context
 
Notify the KillOpListeners outside of the service context's lock

The contract for each KillOpListener would become such that it should be prepared for the operation id on which it is invoked might not exist by the time ::interrupt is called.

schwerin/milkie - do you see any problem with this?

Generated at Thu Feb 08 04:11:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.