-
Type: Bug
-
Resolution: Done
-
Priority: Critical - P2
-
Affects Version/s: 3.3.14
-
Component/s: Replication
-
None
-
Fully Compatible
-
ALL
-
Sharding 2016-10-10, Sharding 2016-10-31
-
0
Interrupting operations at replication stepdown acquires first the Client lock and then the replication coordinator mutex like shown in this thread:
[2016/09/22 23:14:59.994] (Inline Function) --------`-------- mongod!std::lock_guard<std::mutex>::{ctor} 0x10 [2016/09/22 23:14:59.994] 00000000`0953d9b0 00000001`4052d8d3 mongod!mongo::repl::ReplicationCoordinatorImpl::interrupt 0x2b [2016/09/22 23:14:59.994] 00000000`0953da40 00000001`4052d6e7 mongod!mongo::ServiceContext::killOperation 0x1a3 [2016/09/22 23:15:00.011] 00000000`0953dad0 00000001`404474ae mongod!mongo::ServiceContext::killAllUserOperations 0xb7 [2016/09/22 23:15:00.011] 00000000`0953db40 00000001`404472a5 mongod!mongo::repl::ReplicationCoordinatorImpl::stepDown_nonBlocking 0x15e [2016/09/22 23:15:00.011] 00000000`0953dca0 00000001`404694bc mongod!mongo::repl::ReplicationCoordinatorImpl::stepDown 0x85 [2016/09/22 23:15:00.011] 00000000`0953dd20 00000001`3ff7cd76 mongod!mongo::repl::CmdReplSetStepDown::run 0x3dc [2016/09/22 23:15:00.011] 00000000`0953def0 00000001`3ff78892 mongod!mongo::Command::run 0x646 [2016/09/22 23:15:00.011] 00000000`0953e450 00000001`404c1770 mongod!mongo::Command::execCommand 0xb22
Threads, which are about to wait for read or write concern first hold the replication coordinator mutex and then acquire the Client lock, like shown in this thread:
[2016/09/22 23:14:59.895] (Inline Function) --------`-------- mongod!std::lock_guard<mongo::Client>::{ctor}+0x13 [2016/09/22 23:14:59.895] (Inline Function) --------`-------- mongod!mongo::OperationContext::waitForConditionOrInterruptNoAssertUntil::__l2::<lambda_cff863bf3f2e525e5f5bbda6979e07b5>::operator()+0x13 [2016/09/22 23:14:59.895] 00000000`0eb4c2f0 00000001`4018eaf2 mongod!std::condition_variable::wait<<lambda_cff863bf3f2e525e5f5bbda6979e07b5> >+0x33 [2016/09/22 23:14:59.895] 00000000`0eb4c320 00000001`4018e864 mongod!mongo::OperationContext::waitForConditionOrInterruptNoAssertUntil+0x1e2 [2016/09/22 23:14:59.895] 00000000`0eb4c3c0 00000001`404490c6 mongod!mongo::OperationContext::waitForConditionOrInterruptNoAssert+0x54 [2016/09/22 23:14:59.895] 00000000`0eb4c450 00000001`4077a825 mongod!mongo::repl::ReplicationCoordinatorImpl::waitUntilOpTimeForRead+0x426 [2016/09/22 23:14:59.895] 00000000`0eb4c710 00000001`4076ff92 mongod!mongo::ShardLocal::_exhaustiveFindOnConfig+0x135 [2016/09/22 23:14:59.910] 00000000`0eb4ca60 00000001`4071ba16 mongod!mongo::Shard::exhaustiveFindOnConfig+0x122 [2016/09/22 23:14:59.910] 00000000`0eb4cbb0 00000001`40723a04 mongod!mongo::ShardingCatalogClientImpl::_exhaustiveFindOnConfig+0xd6 [2016/09/22 23:14:59.910] 00000000`0eb4cd40 00000001`406f0955 mongod!mongo::ShardingCatalogClientImpl::getAllShards+0x134 [2016/09/22 23:14:59.910] 00000000`0eb4d110 00000001`406d194d mongod!mongo::ClusterStatisticsImpl::getStats+0x75 [2016/09/22 23:14:59.910] 00000000`0eb4d590 00000001`406c48bc mongod!mongo::BalancerChunkSelectionPolicyImpl::checkMoveAllowed+0x5d [2016/09/22 23:14:59.910] 00000000`0eb4d9b0 00000001`404da050 mongod!mongo::Balancer::moveSingleChunk+0x7c [2016/09/22 23:14:59.910] 00000000`0eb4db10 00000001`3ff7ce3a mongod!mongo::`anonymous namespace'::ConfigSvrMoveChunkCommand::run+0xc0 [2016/09/22 23:14:59.910] 00000000`0eb4de30 00000001`3ff78892 mongod!mongo::Command::run+0x70a [2016/09/22 23:14:59.910] 00000000`0eb4e390 00000001`404c1770 mongod!mongo::Command::execCommand+0xb22
This results in deadlock when replication stepdown tries to interrupt threads. The ordering didn't use to be a problem before this change, which made replication threads use the generic OperationContext interruptable waits.
- is depended on by
-
SERVER-26339 Stepdown waiters are signaled twice when running stepdown command during shutdown
- Closed
- is related to
-
SERVER-26345 Make threads waiting for writeConcern use interruptible wait
- Closed