Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-26305

Deadlock between replication stepdown and threads about to wait for read concern

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Critical - P2 Critical - P2
    • 3.4.0-rc1
    • Affects Version/s: 3.3.14
    • Component/s: Replication
    • None
    • Fully Compatible
    • ALL
    • Sharding 2016-10-10, Sharding 2016-10-31
    • 0

      Interrupting operations at replication stepdown acquires first the Client lock and then the replication coordinator mutex like shown in this thread:

       [2016/09/22 23:14:59.994] (Inline Function) --------`-------- mongod!std::lock_guard<std::mutex>::{ctor} 0x10
       [2016/09/22 23:14:59.994] 00000000`0953d9b0 00000001`4052d8d3 mongod!mongo::repl::ReplicationCoordinatorImpl::interrupt 0x2b
       [2016/09/22 23:14:59.994] 00000000`0953da40 00000001`4052d6e7 mongod!mongo::ServiceContext::killOperation 0x1a3
       [2016/09/22 23:15:00.011] 00000000`0953dad0 00000001`404474ae mongod!mongo::ServiceContext::killAllUserOperations 0xb7
       [2016/09/22 23:15:00.011] 00000000`0953db40 00000001`404472a5 mongod!mongo::repl::ReplicationCoordinatorImpl::stepDown_nonBlocking 0x15e
       [2016/09/22 23:15:00.011] 00000000`0953dca0 00000001`404694bc mongod!mongo::repl::ReplicationCoordinatorImpl::stepDown 0x85
       [2016/09/22 23:15:00.011] 00000000`0953dd20 00000001`3ff7cd76 mongod!mongo::repl::CmdReplSetStepDown::run 0x3dc
       [2016/09/22 23:15:00.011] 00000000`0953def0 00000001`3ff78892 mongod!mongo::Command::run 0x646
       [2016/09/22 23:15:00.011] 00000000`0953e450 00000001`404c1770 mongod!mongo::Command::execCommand 0xb22
      

      Threads, which are about to wait for read or write concern first hold the replication coordinator mutex and then acquire the Client lock, like shown in this thread:

       [2016/09/22 23:14:59.895] (Inline Function) --------`-------- mongod!std::lock_guard<mongo::Client>::{ctor}+0x13
       [2016/09/22 23:14:59.895] (Inline Function) --------`-------- mongod!mongo::OperationContext::waitForConditionOrInterruptNoAssertUntil::__l2::<lambda_cff863bf3f2e525e5f5bbda6979e07b5>::operator()+0x13
       [2016/09/22 23:14:59.895] 00000000`0eb4c2f0 00000001`4018eaf2 mongod!std::condition_variable::wait<<lambda_cff863bf3f2e525e5f5bbda6979e07b5> >+0x33
       [2016/09/22 23:14:59.895] 00000000`0eb4c320 00000001`4018e864 mongod!mongo::OperationContext::waitForConditionOrInterruptNoAssertUntil+0x1e2
       [2016/09/22 23:14:59.895] 00000000`0eb4c3c0 00000001`404490c6 mongod!mongo::OperationContext::waitForConditionOrInterruptNoAssert+0x54
       [2016/09/22 23:14:59.895] 00000000`0eb4c450 00000001`4077a825 mongod!mongo::repl::ReplicationCoordinatorImpl::waitUntilOpTimeForRead+0x426
       [2016/09/22 23:14:59.895] 00000000`0eb4c710 00000001`4076ff92 mongod!mongo::ShardLocal::_exhaustiveFindOnConfig+0x135
       [2016/09/22 23:14:59.910] 00000000`0eb4ca60 00000001`4071ba16 mongod!mongo::Shard::exhaustiveFindOnConfig+0x122
       [2016/09/22 23:14:59.910] 00000000`0eb4cbb0 00000001`40723a04 mongod!mongo::ShardingCatalogClientImpl::_exhaustiveFindOnConfig+0xd6
       [2016/09/22 23:14:59.910] 00000000`0eb4cd40 00000001`406f0955 mongod!mongo::ShardingCatalogClientImpl::getAllShards+0x134
       [2016/09/22 23:14:59.910] 00000000`0eb4d110 00000001`406d194d mongod!mongo::ClusterStatisticsImpl::getStats+0x75
       [2016/09/22 23:14:59.910] 00000000`0eb4d590 00000001`406c48bc mongod!mongo::BalancerChunkSelectionPolicyImpl::checkMoveAllowed+0x5d
       [2016/09/22 23:14:59.910] 00000000`0eb4d9b0 00000001`404da050 mongod!mongo::Balancer::moveSingleChunk+0x7c
       [2016/09/22 23:14:59.910] 00000000`0eb4db10 00000001`3ff7ce3a mongod!mongo::`anonymous namespace'::ConfigSvrMoveChunkCommand::run+0xc0
       [2016/09/22 23:14:59.910] 00000000`0eb4de30 00000001`3ff78892 mongod!mongo::Command::run+0x70a
       [2016/09/22 23:14:59.910] 00000000`0eb4e390 00000001`404c1770 mongod!mongo::Command::execCommand+0xb22
      

      This results in deadlock when replication stepdown tries to interrupt threads. The ordering didn't use to be a problem before this change, which made replication threads use the generic OperationContext interruptable waits.

            Assignee:
            schwerin@mongodb.com Andy Schwerin
            Reporter:
            kaloian.manassiev@mongodb.com Kaloian Manassiev
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: