Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-52564

Deadlock between step down and MongoDOperationContextSession

    • Fully Compatible
    • ALL
    • v4.4, v4.2
    • Sharding 2020-11-30, Sharding 2020-12-14, Sharding 2020-12-28, Sharding 2021-01-11, Sharding 2021-01-25, Sharding 2021-02-08
    • 18

      There is a deadlock between the thread that is running the process of stepping down and the session catalog migration producer. More concretely:
      1. The thread that is running the invalidateSessionsForStepdown  is holding a lock (RSTL lock) and is sitting on a condition variable waiting to check out session.
      2. The session catalog migration thread is blocked here , waiting to get the lock held by [1] but it will never get it because this thread is also the one that should check out the session and notify [1].

      The thread holding the RSTL lock on version 4.4 might have a stacktrace like the following:

      #0  0x00007f1e44d01c3d in poll () from /lib64/libc.so.6
      #1  0x000056130ba24f87 in mongo::transport::TransportLayerASIO::BatonASIO::run(mongo::ClockSource*) ()
      #2  0x000056130ba0623d in mongo::transport::TransportLayerASIO::BatonASIO::run_until(mongo::ClockSource*, mongo::Date_t) ()
      #3  0x000056130bef5821 in mongo::ClockSource::waitForConditionUntil(mongo::stdx::condition_variable&, mongo::BasicLockableAdapter, mongo::Date_t, mongo::Waitable*) ()
      #4  0x000056130beeacd0 in mongo::OperationContext::waitForConditionOrInterruptNoAssertUntil(mongo::stdx::condition_variable&, mongo::BasicLockableAdapter, mongo::Date_t) ()
      #5  0x000056130bea0795 in _ZZN5mongo13Interruptible32waitForConditionOrInterruptUntilISt11unique_lockINS_12latch_detail5LatchEEZNS_28CondVarLockGrantNotification4waitEPNS_16OperationContextENS_8DurationISt5ratioILl1ELl1000EEEEEUlvE_EEbRNS_4stdx18condition_variableERT_NS_6Date_tET0_PNS_10AtomicWordIlEEENKUlSJ_NS0_9WakeSpeedEE1_clESJ_SO_ ()
      #6  0x000056130bea0daf in mongo::CondVarLockGrantNotification::wait(mongo::OperationContext*, mongo::Duration<std::ratio<1l, 1000l> >) ()
      #7  0x000056130bea29c6 in mongo::LockerImpl::_lockComplete(mongo::OperationContext*, mongo::ResourceId, mongo::LockMode, mongo::Date_t) ()
      #8  0x000056130beab773 in mongo::repl::ReplicationStateTransitionLockGuard::waitForLockUntil(mongo::Date_t) ()
      #9  0x000056130a3269f7 in mongo::repl::ReplicationCoordinatorImpl::AutoGetRstlForStepUpStepDown::AutoGetRstlForStepUpStepDown(mongo::repl::ReplicationCoordinatorImpl*, mongo::OperationContext*, mongo::repl::ReplicationCoordinator::OpsKillingStateTransitionEnum, mongo::Date_t) ()
      #10 0x000056130a34bee9 in mongo::repl::ReplicationCoordinatorImpl::_stepDownFinish(mongo::executor::TaskExecutor::CallbackArgs const&, mongo::executor::TaskExecutor::EventHandle const&) ()
      ...
      

      The other thread's stacktrace might be different depending on the operation, however, there will be a chunk migration thread on the session migration step (most likely on the SessionCatalogMigrationDestination class).

            Assignee:
            randolph@mongodb.com Randolph Tan
            Reporter:
            sergi.mateo-bellido@mongodb.com Sergi Mateo Bellido
            Votes:
            0 Vote for this issue
            Watchers:
            14 Start watching this issue

              Created:
              Updated:
              Resolved: