[SERVER-25734] Deadlock between the balancer and step-up/stepdown Created: 22/Aug/16  Updated: 19/Nov/16  Resolved: 15/Sep/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.3.12
Fix Version/s: 3.3.14

Type: Bug Priority: Major - P3
Reporter: Spencer Brody (Inactive) Assignee: Kaloian Manassiev
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding 2016-08-29, Sharding 2016-09-19
Participants:
Linked BF Score: 0

 Description   

Seen in my patch build

Balancer is running, trying to take a distributed lock, which does a local write and thus tries to take a database lock. It can't take the database lock because another thread is trying to make the process stepdown (which holds a global lock), but stepdown is blocked trying to stop the balancer. Deadlock.

Stepdown thread:

 [2016/08/22 16:01:27.163] Thread 2443 (Thread 0x7f0081552700 (LWP 12463)):
 [2016/08/22 16:01:27.163] #0  0x00007f00beb0e334 in __lll_lock_wait () from /lib64/libpthread.so.0
 [2016/08/22 16:01:27.163] #1  0x00007f00beb095d8 in _L_lock_854 () from /lib64/libpthread.so.0
 [2016/08/22 16:01:27.163] #2  0x00007f00beb094a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
 [2016/08/22 16:01:27.164] #3  0x00007f00c2951612 in mongo::MigrationManager::interruptAndDisableMigrations() ()
 [2016/08/22 16:01:27.164] #4  0x00007f00c292381c in mongo::Balancer::stopThread() ()
 [2016/08/22 16:01:27.164] #5  0x00007f00c25e5525 in mongo::repl::ReplicationCoordinatorExternalStateImpl::shardingOnStepDownHook() ()
 [2016/08/22 16:01:27.164] #6  0x00007f00c2605411 in mongo::repl::ReplicationCoordinatorImpl::_performPostMemberStateUpdateAction(mongo::repl::ReplicationCoordinatorImpl::PostMemberStateUpdateAction) ()
 [2016/08/22 16:01:27.164] #7  0x00007f00c2606ebe in mongo::repl::ReplicationCoordinatorImpl::_stepDownContinue(mongo::executor::TaskExecutor::EventHandle, mongo::OperationContext*, mongo::Date_t, mongo::Date_t, bool, bool, mongo::Status*) ()
 [2016/08/22 16:01:27.164] #8  0x00007f00c2607374 in mongo::repl::ReplicationCoordinatorImpl::stepDown_nonBlocking(mongo::OperationContext*, bool, mongo::Duration<std::ratio<1l, 1000l> > const&, mongo::Duration<std::ratio<1l, 1000l> > const&, mongo::Status*) ()
 [2016/08/22 16:01:27.164] #9  0x00007f00c260767a in mongo::repl::ReplicationCoordinatorImpl::stepDown(mongo::OperationContext*, bool, mongo::Duration<std::ratio<1l, 1000l> > const&, mongo::Duration<std::ratio<1l, 1000l> > const&) ()
 [2016/08/22 16:01:27.164] #10 0x00007f00c2638226 in mongo::repl::CmdReplSetStepDown::run(mongo::OperationContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mongo::BSONObj&, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, mongo::BSONObjBuilder&) ()
 [2016/08/22 16:01:27.165] #11 0x00007f00c1fea5a4 in mongo::Command::run(mongo::OperationContext*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) ()
 [2016/08/22 16:01:27.165] #12 0x00007f00c1febc11 in mongo::Command::execCommand(mongo::OperationContext*, mongo::Command*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) ()
 [2016/08/22 16:01:27.165] #13 0x00007f00c269b64d in mongo::runCommands(mongo::OperationContext*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) ()
 [2016/08/22 16:01:27.165] #14 0x00007f00c220187b in mongo::assembleResponse(mongo::OperationContext*, mongo::Message&, mongo::DbResponse&, mongo::HostAndPort const&) ()
 [2016/08/22 16:01:27.165] #15 0x00007f00c1e16b4e in mongo::ServiceEntryPointMongod::_sessionLoop(mongo::transport::Session*) ()
 [2016/08/22 16:01:27.165] #16 0x00007f00c2ad78ab in mongo::(anonymous namespace)::runFunc(void*) ()
 [2016/08/22 16:01:27.165] #17 0x00007f00beb07aa1 in start_thread () from /lib64/libpthread.so.0
 [2016/08/22 16:01:27.165] #18 0x00007f00be854aad in clone () from /lib64/libc.so.6

Balancer thread:

 [2016/08/22 16:01:27.145] Thread 2455 (Thread 0x7f0083661700 (LWP 29868)):
 [2016/08/22 16:01:27.145] #0  0x00007f00beb0ba5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
 [2016/08/22 16:01:27.145] #1  0x00007f00c20ba298 in mongo::CondVarLockGrantNotification::wait(unsigned int) ()
 [2016/08/22 16:01:27.146] #2  0x00007f00c20bf2a6 in mongo::LockerImpl<false>::lockComplete(mongo::ResourceId, mongo::LockMode, unsigned int, bool) ()
 [2016/08/22 16:01:27.146] #3  0x00007f00c20aed19 in mongo::Lock::GlobalLock::waitForLock(unsigned int) ()
 [2016/08/22 16:01:27.146] #4  0x00007f00c20af225 in mongo::Lock::GlobalLock::GlobalLock(mongo::Locker*, mongo::LockMode, unsigned int) ()
 [2016/08/22 16:01:27.146] #5  0x00007f00c20af2a7 in mongo::Lock::DBLock::DBLock(mongo::Locker*, mongo::StringData, mongo::LockMode) ()
 [2016/08/22 16:01:27.146] #6  0x00007f00c20d2057 in mongo::AutoGetOrCreateDb::AutoGetOrCreateDb(mongo::OperationContext*, mongo::StringData, mongo::LockMode) ()
 [2016/08/22 16:01:27.146] #7  0x00007f00c200bcfd in mongo::CmdFindAndModify::run(mongo::OperationContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mongo::BSONObj&, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, mongo::BSONObjBuilder&) ()
 [2016/08/22 16:01:27.146] #8  0x00007f00c1fea7a8 in mongo::Command::run(mongo::OperationContext*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) ()
 [2016/08/22 16:01:27.146] #9  0x00007f00c1febc11 in mongo::Command::execCommand(mongo::OperationContext*, mongo::Command*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) ()
 [2016/08/22 16:01:27.146] #10 0x00007f00c269b64d in mongo::runCommands(mongo::OperationContext*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) ()
 [2016/08/22 16:01:27.147] #11 0x00007f00c21ff332 in mongo::(anonymous namespace)::receivedCommand ()
 [2016/08/22 16:01:27.147] #12 0x00007f00c2201306 in mongo::assembleResponse(mongo::OperationContext*, mongo::Message&, mongo::DbResponse&, mongo::HostAndPort const&) ()
 [2016/08/22 16:01:27.147] #13 0x00007f00c20d36bf in mongo::DBDirectClient::call(mongo::Message&, mongo::Message&, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) ()
 [2016/08/22 16:01:27.147] #14 0x00007f00c1e835cc in mongo::DBClientWithCommands::runCommandWithMetadata(mongo::StringData, mongo::StringData, mongo::BSONObj const&, mongo::BSONObj const&) ()
 [2016/08/22 16:01:27.147] #15 0x00007f00c29e9a3b in mongo::ShardLocal::_runCommand(mongo::OperationContext*, mongo::ReadPreferenceSetting const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mongo::Duration<std::ratio<1l, 1000l> >, mongo::BSONObj const&) ()
 [2016/08/22 16:01:27.147] #16 0x00007f00c29e0193 in mongo::Shard::runCommand(mongo::OperationContext*, mongo::ReadPreferenceSetting const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mongo::BSONObj const&, mongo::Duration<std::ratio<1l, 1000l> >, mongo::Shard::RetryPolicy) ()
 [2016/08/22 16:01:27.147] #17 0x00007f00c296b7b2 in mongo::DistLockCatalogImpl::grabLock(mongo::OperationContext*, mongo::StringData, mongo::OID const&, mongo::StringData, mongo::StringData, mongo::Date_t, mongo::StringData) ()
 [2016/08/22 16:01:27.147] #18 0x00007f00c2971177 in mongo::ReplSetDistLockManager::lockWithSessionID(mongo::OperationContext*, mongo::StringData, mongo::StringData, mongo::OID, mongo::Duration<std::ratio<1l, 1000l> >, mongo::Duration<std::ratio<1l, 1000l> >) ()
 [2016/08/22 16:01:27.148] #19 0x00007f00c29531c0 in mongo::MigrationManager::_scheduleWithDistLock_inlock(mongo::OperationContext*, mongo::HostAndPort const&, mongo::MigrationManager::Migration) ()
 [2016/08/22 16:01:27.148] #20 0x00007f00c2955833 in mongo::MigrationManager::_schedule(mongo::OperationContext*, mongo::MigrateInfo const&, bool, unsigned long, mongo::MigrationSecondaryThrottleOptions const&, bool) ()
 [2016/08/22 16:01:27.148] #21 0x00007f00c2955ef9 in mongo::MigrationManager::executeManualMigration(mongo::OperationContext*, mongo::MigrateInfo const&, unsigned long, mongo::MigrationSecondaryThrottleOptions const&, bool) ()
 [2016/08/22 16:01:27.148] #22 0x00007f00c2925984 in mongo::Balancer::moveSingleChunk(mongo::OperationContext*, mongo::ChunkType const&, mongo::ShardId const&, unsigned long, mongo::MigrationSecondaryThrottleOptions const&, bool) ()
 [2016/08/22 16:01:27.148] #23 0x00007f00c26bd562 in mongo::(anonymous namespace)::ConfigSvrMoveChunkCommand::run(mongo::OperationContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mongo::BSONObj&, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, mongo::BSONObjBuilder&) ()
 [2016/08/22 16:01:27.148] #24 0x00007f00c1fea5a4 in mongo::Command::run(mongo::OperationContext*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) ()
 [2016/08/22 16:01:27.148] #25 0x00007f00c1febc11 in mongo::Command::execCommand(mongo::OperationContext*, mongo::Command*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) ()
 [2016/08/22 16:01:27.148] #26 0x00007f00c269b64d in mongo::runCommands(mongo::OperationContext*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) ()
 [2016/08/22 16:01:27.148] #27 0x00007f00c220187b in mongo::assembleResponse(mongo::OperationContext*, mongo::Message&, mongo::DbResponse&, mongo::HostAndPort const&) ()
 [2016/08/22 16:01:27.149] #28 0x00007f00c1e16b4e in mongo::ServiceEntryPointMongod::_sessionLoop(mongo::transport::Session*) ()
 [2016/08/22 16:01:27.149] #29 0x00007f00c2ad78ab in mongo::(anonymous namespace)::runFunc(void*) ()
 [2016/08/22 16:01:27.149] #30 0x00007f00beb07aa1 in start_thread () from /lib64/libpthread.so.0
 [2016/08/22 16:01:27.149] #31 0x00007f00be854aad in clone () from /lib64/libc.so.6



 Comments   
Comment by Kaloian Manassiev [ 15/Sep/16 ]

The deadlock on step down is gone away as a side effect of this commit, which removes interrupting the balancer on step down: https://github.com/mongodb/mongo/commit/3596a4fb09b31c955d95fbe17db0492589c96f03

Comment by Githook User [ 10/Sep/16 ]

Author:

{u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}

Message: SERVER-25734 Add a drain mode completion hook which runs outside of the global X lock
Branch: master
https://github.com/mongodb/mongo/commit/8eea87b6b90499d526e085ba723ddf989b837994

Generated at Thu Feb 08 04:10:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.