Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-28545

Replication subsystem holds Global lock in MODE_X while waiting for member state to change to ROLLBACK causing server to hang

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 3.5.10
    • Affects Version/s: None
    • Component/s: Replication
    • Labels:
      None
    • Fully Compatible
    • ALL
    • v3.4
    • Hide

      I have not attempted to reproduce this issue outside of running Jepsen's "set" test which was integrated into Evergreen as part of SERVER-28461.

      lein run test --test set --tarball "file:///root/mongo-binaries.tgz" --ssh-private-key ~/.ssh/id_rsa_lxc --clock-skew faketime --libfaketime-path /opt/mongodb/libfaketime.so.1 --key-time-limit 15 --protocol-version 1 --read-concern linearizable --storage-engine mmapv1 --time-limit 300
      

      I will update this section of the ticket if I'm able to reproduce the issue via simpler means.

      Show
      I have not attempted to reproduce this issue outside of running Jepsen's "set" test which was integrated into Evergreen as part of SERVER-28461 . lein run test --test set --tarball "file:///root/mongo-binaries.tgz" --ssh-private-key ~/.ssh/id_rsa_lxc --clock-skew faketime --libfaketime-path /opt/mongodb/libfaketime.so.1 --key-time-limit 15 --protocol-version 1 --read-concern linearizable --storage-engine mmapv1 --time-limit 300 I will update this section of the ticket if I'm able to reproduce the issue via simpler means.
    • Repl 2017-05-08, Repl 2017-05-29, Repl 2017-07-10
    • 15

      It is possible that while the "rsBackgroundSync" thread is changing the member state to ROLLBACK for a thread running work on the ReplicationExecutor to need to acquire a lock. This design of holding a LockManager lock while waiting on a condition variable outside of the lock hierarchy seems prone to deadlock. For example, in the GDB output below, thread #39 is holding the Global lock in MODE_X and waiting for its task to set the follower mode to MemberState::RS_ROLLBACK in the ReplicationExecutor. The ReplicationExecutor is currently processing a vote response in thread #13 which waiting for the storage engine to make it durable. The durability thread (#6) is waiting to acquire the MMAPv1 flush lock, which is implicitly held by thread #39 as part of acquiring the global lock.

      Thread 39 (Thread 0x7fc1e03f0700 (LWP 20506)):
      #0  0x00007fc27f0f5404 in pthread_cond_wait@@GLIBC_2.3.2 () from target:/lib/x86_64-linux-gnu/libpthread.so.0
      #1  0x00007fc2826fba7c in std::condition_variable::wait(std::unique_lock<std::mutex>&) ()
      #2  0x00007fc28170ea8b in mongo::repl::ReplicationExecutor::Event::waitUntilSignaled() ()
      #3  0x00007fc2816f0e7d in mongo::repl::ReplicationCoordinatorImpl::setFollowerMode(mongo::repl::MemberState const&) ()
      #4  0x00007fc281735ef8 in mongo::repl::rollback(mongo::OperationContext*, mongo::repl::OplogInterface const&, mongo::repl::RollbackSource const&, int, mongo::repl::ReplicationCoordinator*, mongo::repl::StorageInterface*, std::function<void (int)>) ()
      #5  0x00007fc2816037c2 in mongo::repl::BackgroundSync::_runRollback(mongo::OperationContext*, mongo::Status const&, mongo::HostAndPort const&, int, mongo::repl::StorageInterface*) ()
      #6  0x00007fc281605b0e in mongo::repl::BackgroundSync::_produce(mongo::OperationContext*) ()
      #7  0x00007fc28160661a in mongo::repl::BackgroundSync::_runProducer() ()
      #8  0x00007fc28160679a in mongo::repl::BackgroundSync::_run() ()
      #9  0x00007fc2826fe690 in execute_native_thread_routine ()
      #10 0x00007fc27f0f1184 in start_thread () from target:/lib/x86_64-linux-gnu/libpthread.so.0
      #11 0x00007fc27ee1ebed in clone () from target:/lib/x86_64-linux-gnu/libc.so.6
      ...
      Thread 13 (Thread 0x7fc1ed615700 (LWP 20473)):
      #0  0x00007fc27f0f5404 in pthread_cond_wait@@GLIBC_2.3.2 () from target:/lib/x86_64-linux-gnu/libpthread.so.0
      #1  0x00007fc2826fba7c in std::condition_variable::wait(std::unique_lock<std::mutex>&) ()
      #2  0x00007fc2818d2cab in mongo::CommitNotifier::awaitBeyondNow() ()
      #3  0x00007fc2818d6a40 in mongo::dur::(anonymous namespace)::DurableImpl::waitUntilDurable() ()
      #4  0x00007fc2816d57e0 in mongo::repl::ReplicationCoordinatorExternalStateImpl::storeLocalLastVoteDocument(mongo::OperationContext*, mongo::repl::LastVote const&) ()
      #5  0x00007fc2816ff04b in mongo::repl::ReplicationCoordinatorImpl::_writeLastVoteForMyElection(mongo::repl::LastVote, mongo::executor::TaskExecutor::CallbackArgs const&) ()
      #6  0x00007fc28170f840 in mongo::repl::ReplicationExecutor::_doOperation(mongo::OperationContext*, mongo::Status const&, mongo::executor::TaskExecutor::CallbackHandle const&, std::__cxx11::list<mongo::repl::ReplicationExecutor::WorkItem, std::allocator<mongo::repl::ReplicationExecutor::WorkItem> >*, std::mutex*) ()
      #7  0x00007fc28170e0ed in mongo::repl::(anonymous namespace)::callNoExcept(std::function<void ()> const&) ()
      #8  0x00007fc281715a30 in std::_Function_handler<mongo::repl::TaskRunner::NextAction (mongo::OperationContext*, mongo::Status const&), mongo::repl::ReplicationExecutor::scheduleDBWork(std::function<void (mongo::executor::TaskExecutor::CallbackArgs const&)> const&, mongo::NamespaceString const&, mongo::LockMode)::{lambda(mongo::OperationContext*, mongo::Status const&)#1}>::_M_invoke(std::_Any_data const&, mongo::OperationContext*&&, mongo::Status const&) ()
      #9  0x00007fc28175d349 in mongo::repl::(anonymous namespace)::runSingleTask(std::function<mongo::repl::TaskRunner::NextAction (mongo::OperationContext*, mongo::Status const&)> const&, mongo::OperationContext*, mongo::Status const&) [clone .constprop.72] ()
      #10 0x00007fc28175e46f in mongo::repl::TaskRunner::_runTasks() ()
      #11 0x00007fc281bf38ec in mongo::ThreadPool::_doOneTask(std::unique_lock<std::mutex>*) ()
      #12 0x00007fc281bf439c in mongo::ThreadPool::_consumeTasks() ()
      #13 0x00007fc281bf4d56 in mongo::ThreadPool::_workerThreadBody(mongo::ThreadPool*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
      #14 0x00007fc2826fe690 in execute_native_thread_routine ()
      #15 0x00007fc27f0f1184 in start_thread () from target:/lib/x86_64-linux-gnu/libpthread.so.0
      #16 0x00007fc27ee1ebed in clone () from target:/lib/x86_64-linux-gnu/libc.so.6
      ...
      Thread 6 (Thread 0x7fc27cd1c700 (LWP 20466)):
      #0  0x00007fc27f0f57be in pthread_cond_timedwait@@GLIBC_2.3.2 () from target:/lib/x86_64-linux-gnu/libpthread.so.0
      #1  0x00007fc281225fb8 in mongo::CondVarLockGrantNotification::wait(unsigned int) ()
      #2  0x00007fc28122a6be in mongo::LockerImpl<true>::lockComplete(mongo::ResourceId, mongo::LockMode, unsigned int, bool) ()
      #3  0x00007fc2812261d6 in mongo::AutoAcquireFlushLockForMMAPV1Commit::AutoAcquireFlushLockForMMAPV1Commit(mongo::Locker*) ()
      #4  0x00007fc2818d7f1f in mongo::dur::durThread(mongo::ClockSource*, long) ()
      #5  0x00007fc2826fe690 in execute_native_thread_routine ()
      #6  0x00007fc27f0f1184 in start_thread () from target:/lib/x86_64-linux-gnu/libpthread.so.0
      #7  0x00007fc27ee1ebed in clone () from target:/lib/x86_64-linux-gnu/libc.so.6
      

      Thank you to benety.goh for helping me with the GDB output.

            Assignee:
            spencer@mongodb.com Spencer Brody (Inactive)
            Reporter:
            max.hirschhorn@mongodb.com Max Hirschhorn
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated:
              Resolved: