Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-25734

Deadlock between the balancer and step-up/stepdown

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 3.3.14
    • Affects Version/s: 3.3.12
    • Component/s: Sharding
    • Labels:
      None
    • Fully Compatible
    • ALL
    • Sharding 2016-08-29, Sharding 2016-09-19
    • 0

      Seen in my patch build

      Balancer is running, trying to take a distributed lock, which does a local write and thus tries to take a database lock. It can't take the database lock because another thread is trying to make the process stepdown (which holds a global lock), but stepdown is blocked trying to stop the balancer. Deadlock.

      Stepdown thread:

       [2016/08/22 16:01:27.163] Thread 2443 (Thread 0x7f0081552700 (LWP 12463)):
       [2016/08/22 16:01:27.163] #0  0x00007f00beb0e334 in __lll_lock_wait () from /lib64/libpthread.so.0
       [2016/08/22 16:01:27.163] #1  0x00007f00beb095d8 in _L_lock_854 () from /lib64/libpthread.so.0
       [2016/08/22 16:01:27.163] #2  0x00007f00beb094a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
       [2016/08/22 16:01:27.164] #3  0x00007f00c2951612 in mongo::MigrationManager::interruptAndDisableMigrations() ()
       [2016/08/22 16:01:27.164] #4  0x00007f00c292381c in mongo::Balancer::stopThread() ()
       [2016/08/22 16:01:27.164] #5  0x00007f00c25e5525 in mongo::repl::ReplicationCoordinatorExternalStateImpl::shardingOnStepDownHook() ()
       [2016/08/22 16:01:27.164] #6  0x00007f00c2605411 in mongo::repl::ReplicationCoordinatorImpl::_performPostMemberStateUpdateAction(mongo::repl::ReplicationCoordinatorImpl::PostMemberStateUpdateAction) ()
       [2016/08/22 16:01:27.164] #7  0x00007f00c2606ebe in mongo::repl::ReplicationCoordinatorImpl::_stepDownContinue(mongo::executor::TaskExecutor::EventHandle, mongo::OperationContext*, mongo::Date_t, mongo::Date_t, bool, bool, mongo::Status*) ()
       [2016/08/22 16:01:27.164] #8  0x00007f00c2607374 in mongo::repl::ReplicationCoordinatorImpl::stepDown_nonBlocking(mongo::OperationContext*, bool, mongo::Duration<std::ratio<1l, 1000l> > const&, mongo::Duration<std::ratio<1l, 1000l> > const&, mongo::Status*) ()
       [2016/08/22 16:01:27.164] #9  0x00007f00c260767a in mongo::repl::ReplicationCoordinatorImpl::stepDown(mongo::OperationContext*, bool, mongo::Duration<std::ratio<1l, 1000l> > const&, mongo::Duration<std::ratio<1l, 1000l> > const&) ()
       [2016/08/22 16:01:27.164] #10 0x00007f00c2638226 in mongo::repl::CmdReplSetStepDown::run(mongo::OperationContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mongo::BSONObj&, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, mongo::BSONObjBuilder&) ()
       [2016/08/22 16:01:27.165] #11 0x00007f00c1fea5a4 in mongo::Command::run(mongo::OperationContext*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) ()
       [2016/08/22 16:01:27.165] #12 0x00007f00c1febc11 in mongo::Command::execCommand(mongo::OperationContext*, mongo::Command*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) ()
       [2016/08/22 16:01:27.165] #13 0x00007f00c269b64d in mongo::runCommands(mongo::OperationContext*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) ()
       [2016/08/22 16:01:27.165] #14 0x00007f00c220187b in mongo::assembleResponse(mongo::OperationContext*, mongo::Message&, mongo::DbResponse&, mongo::HostAndPort const&) ()
       [2016/08/22 16:01:27.165] #15 0x00007f00c1e16b4e in mongo::ServiceEntryPointMongod::_sessionLoop(mongo::transport::Session*) ()
       [2016/08/22 16:01:27.165] #16 0x00007f00c2ad78ab in mongo::(anonymous namespace)::runFunc(void*) ()
       [2016/08/22 16:01:27.165] #17 0x00007f00beb07aa1 in start_thread () from /lib64/libpthread.so.0
       [2016/08/22 16:01:27.165] #18 0x00007f00be854aad in clone () from /lib64/libc.so.6
      

      Balancer thread:

       [2016/08/22 16:01:27.145] Thread 2455 (Thread 0x7f0083661700 (LWP 29868)):
       [2016/08/22 16:01:27.145] #0  0x00007f00beb0ba5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
       [2016/08/22 16:01:27.145] #1  0x00007f00c20ba298 in mongo::CondVarLockGrantNotification::wait(unsigned int) ()
       [2016/08/22 16:01:27.146] #2  0x00007f00c20bf2a6 in mongo::LockerImpl<false>::lockComplete(mongo::ResourceId, mongo::LockMode, unsigned int, bool) ()
       [2016/08/22 16:01:27.146] #3  0x00007f00c20aed19 in mongo::Lock::GlobalLock::waitForLock(unsigned int) ()
       [2016/08/22 16:01:27.146] #4  0x00007f00c20af225 in mongo::Lock::GlobalLock::GlobalLock(mongo::Locker*, mongo::LockMode, unsigned int) ()
       [2016/08/22 16:01:27.146] #5  0x00007f00c20af2a7 in mongo::Lock::DBLock::DBLock(mongo::Locker*, mongo::StringData, mongo::LockMode) ()
       [2016/08/22 16:01:27.146] #6  0x00007f00c20d2057 in mongo::AutoGetOrCreateDb::AutoGetOrCreateDb(mongo::OperationContext*, mongo::StringData, mongo::LockMode) ()
       [2016/08/22 16:01:27.146] #7  0x00007f00c200bcfd in mongo::CmdFindAndModify::run(mongo::OperationContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mongo::BSONObj&, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, mongo::BSONObjBuilder&) ()
       [2016/08/22 16:01:27.146] #8  0x00007f00c1fea7a8 in mongo::Command::run(mongo::OperationContext*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) ()
       [2016/08/22 16:01:27.146] #9  0x00007f00c1febc11 in mongo::Command::execCommand(mongo::OperationContext*, mongo::Command*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) ()
       [2016/08/22 16:01:27.146] #10 0x00007f00c269b64d in mongo::runCommands(mongo::OperationContext*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) ()
       [2016/08/22 16:01:27.147] #11 0x00007f00c21ff332 in mongo::(anonymous namespace)::receivedCommand ()
       [2016/08/22 16:01:27.147] #12 0x00007f00c2201306 in mongo::assembleResponse(mongo::OperationContext*, mongo::Message&, mongo::DbResponse&, mongo::HostAndPort const&) ()
       [2016/08/22 16:01:27.147] #13 0x00007f00c20d36bf in mongo::DBDirectClient::call(mongo::Message&, mongo::Message&, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) ()
       [2016/08/22 16:01:27.147] #14 0x00007f00c1e835cc in mongo::DBClientWithCommands::runCommandWithMetadata(mongo::StringData, mongo::StringData, mongo::BSONObj const&, mongo::BSONObj const&) ()
       [2016/08/22 16:01:27.147] #15 0x00007f00c29e9a3b in mongo::ShardLocal::_runCommand(mongo::OperationContext*, mongo::ReadPreferenceSetting const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mongo::Duration<std::ratio<1l, 1000l> >, mongo::BSONObj const&) ()
       [2016/08/22 16:01:27.147] #16 0x00007f00c29e0193 in mongo::Shard::runCommand(mongo::OperationContext*, mongo::ReadPreferenceSetting const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mongo::BSONObj const&, mongo::Duration<std::ratio<1l, 1000l> >, mongo::Shard::RetryPolicy) ()
       [2016/08/22 16:01:27.147] #17 0x00007f00c296b7b2 in mongo::DistLockCatalogImpl::grabLock(mongo::OperationContext*, mongo::StringData, mongo::OID const&, mongo::StringData, mongo::StringData, mongo::Date_t, mongo::StringData) ()
       [2016/08/22 16:01:27.147] #18 0x00007f00c2971177 in mongo::ReplSetDistLockManager::lockWithSessionID(mongo::OperationContext*, mongo::StringData, mongo::StringData, mongo::OID, mongo::Duration<std::ratio<1l, 1000l> >, mongo::Duration<std::ratio<1l, 1000l> >) ()
       [2016/08/22 16:01:27.148] #19 0x00007f00c29531c0 in mongo::MigrationManager::_scheduleWithDistLock_inlock(mongo::OperationContext*, mongo::HostAndPort const&, mongo::MigrationManager::Migration) ()
       [2016/08/22 16:01:27.148] #20 0x00007f00c2955833 in mongo::MigrationManager::_schedule(mongo::OperationContext*, mongo::MigrateInfo const&, bool, unsigned long, mongo::MigrationSecondaryThrottleOptions const&, bool) ()
       [2016/08/22 16:01:27.148] #21 0x00007f00c2955ef9 in mongo::MigrationManager::executeManualMigration(mongo::OperationContext*, mongo::MigrateInfo const&, unsigned long, mongo::MigrationSecondaryThrottleOptions const&, bool) ()
       [2016/08/22 16:01:27.148] #22 0x00007f00c2925984 in mongo::Balancer::moveSingleChunk(mongo::OperationContext*, mongo::ChunkType const&, mongo::ShardId const&, unsigned long, mongo::MigrationSecondaryThrottleOptions const&, bool) ()
       [2016/08/22 16:01:27.148] #23 0x00007f00c26bd562 in mongo::(anonymous namespace)::ConfigSvrMoveChunkCommand::run(mongo::OperationContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mongo::BSONObj&, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, mongo::BSONObjBuilder&) ()
       [2016/08/22 16:01:27.148] #24 0x00007f00c1fea5a4 in mongo::Command::run(mongo::OperationContext*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) ()
       [2016/08/22 16:01:27.148] #25 0x00007f00c1febc11 in mongo::Command::execCommand(mongo::OperationContext*, mongo::Command*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) ()
       [2016/08/22 16:01:27.148] #26 0x00007f00c269b64d in mongo::runCommands(mongo::OperationContext*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) ()
       [2016/08/22 16:01:27.148] #27 0x00007f00c220187b in mongo::assembleResponse(mongo::OperationContext*, mongo::Message&, mongo::DbResponse&, mongo::HostAndPort const&) ()
       [2016/08/22 16:01:27.149] #28 0x00007f00c1e16b4e in mongo::ServiceEntryPointMongod::_sessionLoop(mongo::transport::Session*) ()
       [2016/08/22 16:01:27.149] #29 0x00007f00c2ad78ab in mongo::(anonymous namespace)::runFunc(void*) ()
       [2016/08/22 16:01:27.149] #30 0x00007f00beb07aa1 in start_thread () from /lib64/libpthread.so.0
       [2016/08/22 16:01:27.149] #31 0x00007f00be854aad in clone () from /lib64/libc.so.6
      

            Assignee:
            kaloian.manassiev@mongodb.com Kaloian Manassiev
            Reporter:
            spencer@mongodb.com Spencer Brody (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: