Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-26003

Deadlock at replication stepdown involving the NoopWriter

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 3.3.14
    • Affects Version/s: 3.3.12
    • Component/s: Replication
    • Labels:
      None
    • Fully Compatible
    • ALL
    • Repl 2016-09-19
    • 0

      The following very intricate deadlock can be hit at replication stepdown:

      1. T1 executes ReplicationCoordinatorImpl::stepDown which schedules a thread (T2) to do the actual step down and then calls ServiceContext::killAllUserOperations, which holds the ServiceContext mutex while iterating the set of clients. For each operation, which is killed, ReplicationCoordinatorImpl::interrupt is invoked, which acquires the _topoMutex.
      2. T2 runs ReplicationCoordinatorImpl::_stepDownFinish, which acquires the replication coordinator's _topoMutex and calls _performPostMemberStateUpdateAction(kActionCloseAllConnections) while holding that mutex. The call to _performPostMemberStateUpdateAction eventually tries to stop the NoopWriter, which waits for that thread (T3) to join.
      3. T3 (PeriodicNoopRunner::run) is signaled to terminate and as part of its destructor tries to unregister its client from the ServiceContext. However, the service context lock is held by T1 above, which deadlocks.

      This is the list of threads, which are involved in the deadlock:

      T1:

       [2016/09/06 22:35:19.725] Thread 1049 (Thread 0x7fc46de6d700 (LWP 11173)):
       [2016/09/06 22:35:19.725] #0  0x00007fc4aac21334 in __lll_lock_wait () from /lib64/libpthread.so.0
       [2016/09/06 22:35:19.725] #1  0x00007fc4aac1c5d8 in _L_lock_854 () from /lib64/libpthread.so.0
       [2016/09/06 22:35:19.726] #2  0x00007fc4aac1c4a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
       [2016/09/06 22:35:19.726] #3  0x00007fc4ae70eff1 in mongo::repl::ReplicationCoordinatorImpl::interrupt(unsigned int) ()
       [2016/09/06 22:35:19.726] #4  0x00007fc4ae84e46c in mongo::ServiceContext::killOperation(mongo::OperationContext*, mongo::ErrorCodes::Error) ()
       [2016/09/06 22:35:19.726] #5  0x00007fc4ae84e558 in mongo::ServiceContext::killAllUserOperations(mongo::OperationContext const*, mongo::ErrorCodes::Error) ()
       [2016/09/06 22:35:19.726] #6  0x00007fc4ae71def1 in mongo::repl::ReplicationCoordinatorImpl::stepDown_nonBlocking(mongo::OperationContext*, bool, mongo::Duration<std::ratio<1l, 1000l> > const&, mongo::Duration<std::ratio<1l, 1000l> > const&, mongo::Status*) ()
       [2016/09/06 22:35:19.727] #7  0x00007fc4ae71e27a in mongo::repl::ReplicationCoordinatorImpl::stepDown(mongo::OperationContext*, bool, mongo::Duration<std::ratio<1l, 1000l> > const&, mongo::Duration<std::ratio<1l, 1000l> > const&) ()
       [2016/09/06 22:35:19.727] #8  0x00007fc4ae74f046 in mongo::repl::CmdReplSetStepDown::run(mongo::OperationContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mongo::BSONObj&, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, mongo::BSONObjBuilder&) ()
       [2016/09/06 22:35:19.727] #9  0x00007fc4ae104447 in mongo::Command::run(mongo::OperationContext*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) ()
       [2016/09/06 22:35:19.727] #10 0x00007fc4ae105663 in mongo::Command::execCommand(mongo::OperationContext*, mongo::Command*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) ()
       [2016/09/06 22:35:19.727] #11 0x00007fc4ae7b0b4d in mongo::runCommands(mongo::OperationContext*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) ()
       [2016/09/06 22:35:19.728] #12 0x00007fc4ae31c889 in mongo::assembleResponse(mongo::OperationContext*, mongo::Message&, mongo::DbResponse&, mongo::HostAndPort const&) ()
       [2016/09/06 22:35:19.728] #13 0x00007fc4adf2a97d in mongo::ServiceEntryPointMongod::_sessionLoop(mongo::transport::Session*) ()
       [2016/09/06 22:35:19.728] #14 0x00007fc4adf2b2e0 in std::_Function_handler<void (mongo::transport::Session*), mongo::ServiceEntryPointMongod::startSession(mongo::transport::Session&&)::{lambda(mongo::transport::Session*)#1}>::_M_invoke(std::_Any_data const&, mongo::transport::Session*&&) ()
       [2016/09/06 22:35:19.728] #15 0x00007fc4aebf5b7b in mongo::(anonymous namespace)::runFunc(void*) ()
       [2016/09/06 22:35:19.728] #16 0x00007fc4aac1aaa1 in start_thread () from /lib64/libpthread.so.0
       [2016/09/06 22:35:19.728] #17 0x00007fc4aa967aad in clone () from /lib64/libc.so.6
      

      T2:

       [2016/09/06 22:35:19.640] Thread 1106 (Thread 0x7fc48f7c2700 (LWP 4482)):
       [2016/09/06 22:35:19.640] #0  0x00007fc4aac1b2fd in pthread_join () from /lib64/libpthread.so.0
       [2016/09/06 22:35:19.640] #1  0x00007fc4af6eff57 in std::thread::join() ()
       [2016/09/06 22:35:19.640] #2  0x00007fc4ae6adc9f in mongo::repl::NoopWriter::stopWritingPeriodicNoops() ()
       [2016/09/06 22:35:19.640] #3  0x00007fc4ae71bff2 in mongo::repl::ReplicationCoordinatorImpl::_performPostMemberStateUpdateAction(mongo::repl::ReplicationCoordinatorImpl::PostMemberStateUpdateAction) ()
       [2016/09/06 22:35:19.640] #4  0x00007fc4ae730e1e in mongo::repl::ReplicationCoordinatorImpl::_stepDownFinish(mongo::executor::TaskExecutor::CallbackArgs const&, mongo::executor::TaskExecutor::EventHandle const&) ()
       [2016/09/06 22:35:19.641] #5  0x00007fc4ae73b480 in mongo::repl::ReplicationExecutor::_doOperation(mongo::OperationContext*, mongo::Status const&, mongo::executor::TaskExecutor::CallbackHandle const&, std::__cxx11::list<mongo::repl::ReplicationExecutor::WorkItem, std::allocator<mongo::repl::ReplicationExecutor::WorkItem> >*, std::mutex*) ()
       [2016/09/06 22:35:19.641] #6  0x00007fc4ae739cf9 in mongo::repl::(anonymous namespace)::callNoExcept(std::function<void ()> const&) ()
       [2016/09/06 22:35:19.641] #7  0x00007fc4ae741770 in std::_Function_handler<mongo::repl::TaskRunner::NextAction (mongo::OperationContext*, mongo::Status const&), mongo::repl::ReplicationExecutor::scheduleWorkWithGlobalExclusiveLock(std::function<void (mongo::executor::TaskExecutor::CallbackArgs const&)> const&)::{lambda(mongo::OperationContext*, mongo::Status const&)#1}>::_M_invoke(std::_Any_data const&, mongo::OperationContext*&&, mongo::Status const&) ()
       [2016/09/06 22:35:19.641] #8  0x00007fc4ae673726 in std::_Function_handler<mongo::repl::TaskRunner::NextAction (mongo::OperationContext*, mongo::Status const&), mongo::repl::DatabaseTask::makeGlobalExclusiveLockTask(std::function<mongo::repl::TaskRunner::NextAction (mongo::OperationContext*, mongo::Status const&)> const&)::{lambda(mongo::OperationContext*, mongo::Status const&)#1}>::_M_invoke(std::_Any_data const&, mongo::OperationContext*&&, mongo::Status const&) ()
       [2016/09/06 22:35:19.641] #9  0x00007fc4ae7903a7 in mongo::repl::(anonymous namespace)::runSingleTask(std::function<mongo::repl::TaskRunner::NextAction (mongo::OperationContext*, mongo::Status const&)> const&, mongo::OperationContext*, mongo::Status const&) [clone .constprop.70] ()
       [2016/09/06 22:35:19.642] #10 0x00007fc4ae79167f in mongo::repl::TaskRunner::_runTasks() ()
       [2016/09/06 22:35:19.642] #11 0x00007fc4aec06825 in mongo::ThreadPool::_doOneTask(std::unique_lock<std::mutex>*) ()
       [2016/09/06 22:35:19.642] #12 0x00007fc4aec07350 in mongo::ThreadPool::_consumeTasks() ()
       [2016/09/06 22:35:19.642] #13 0x00007fc4aec07ef9 in mongo::ThreadPool::_workerThreadBody(mongo::ThreadPool*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
       [2016/09/06 22:35:19.642] #14 0x00007fc4af6f0010 in execute_native_thread_routine ()
       [2016/09/06 22:35:19.642] #15 0x00007fc4aac1aaa1 in start_thread () from /lib64/libpthread.so.0
       [2016/09/06 22:35:19.643] #16 0x00007fc4aa967aad in clone () from /lib64/libc.so.6
      

      T3:

       [2016/09/06 22:35:19.713] Thread 1055 (Thread 0x7fc46ee74700 (LWP 5672)):
       [2016/09/06 22:35:19.713] #0  0x00007fc4aac21334 in __lll_lock_wait () from /lib64/libpthread.so.0
       [2016/09/06 22:35:19.713] #1  0x00007fc4aac1c5d8 in _L_lock_854 () from /lib64/libpthread.so.0
       [2016/09/06 22:35:19.713] #2  0x00007fc4aac1c4a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
       [2016/09/06 22:35:19.713] #3  0x00007fc4ae84dfb3 in mongo::ServiceContext::ClientDeleter::operator()(mongo::Client*) const ()
       [2016/09/06 22:35:19.713] #4  0x00007fc4ae0c7771 in boost::thread_specific_ptr<std::unique_ptr<mongo::Client, mongo::ServiceContext::ClientDeleter> >::delete_data::operator()(void*) ()
       [2016/09/06 22:35:19.713] #5  0x00007fc4aee345c7 in tls_destructor ()
       [2016/09/06 22:35:19.714] #6  0x00007fc4aac1ab39 in start_thread () from /lib64/libpthread.so.0
       [2016/09/06 22:35:19.714] #7  0x00007fc4aa967aad in clone () from /lib64/libc.so.6
      

            Assignee:
            siyuan.zhou@mongodb.com Siyuan Zhou
            Reporter:
            kaloian.manassiev@mongodb.com Kaloian Manassiev
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: