[SERVER-52564] Deadlock between step down and MongoDOperationContextSession Created: 02/Nov/20  Updated: 29/Oct/23  Resolved: 04/Feb/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 4.9.0, 4.2.14, 4.4.6

Type: Bug Priority: Major - P3
Reporter: Sergi Mateo Bellido Assignee: Randolph Tan
Resolution: Fixed Votes: 0
Labels: sharding-csrs-stepdown-also, sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Problem/Incident
causes SERVER-57756 Race between concurrent stepdowns and... Closed
Related
related to SERVER-55007 Deadlock between step down and MongoD... Closed
related to SERVER-60161 Deadlock between config server stepdo... Closed
related to SERVER-57167 Prevent throwing on session creation ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4, v4.2
Sprint: Sharding 2020-11-30, Sharding 2020-12-14, Sharding 2020-12-28, Sharding 2021-01-11, Sharding 2021-01-25, Sharding 2021-02-08
Participants:
Case:
Linked BF Score: 18

 Description   

There is a deadlock between the thread that is running the process of stepping down and the session catalog migration producer. More concretely:
1. The thread that is running the invalidateSessionsForStepdown  is holding a lock (RSTL lock) and is sitting on a condition variable waiting to check out session.
2. The session catalog migration thread is blocked here , waiting to get the lock held by [1] but it will never get it because this thread is also the one that should check out the session and notify [1].

The thread holding the RSTL lock on version 4.4 might have a stacktrace like the following:

#0  0x00007f1e44d01c3d in poll () from /lib64/libc.so.6
#1  0x000056130ba24f87 in mongo::transport::TransportLayerASIO::BatonASIO::run(mongo::ClockSource*) ()
#2  0x000056130ba0623d in mongo::transport::TransportLayerASIO::BatonASIO::run_until(mongo::ClockSource*, mongo::Date_t) ()
#3  0x000056130bef5821 in mongo::ClockSource::waitForConditionUntil(mongo::stdx::condition_variable&, mongo::BasicLockableAdapter, mongo::Date_t, mongo::Waitable*) ()
#4  0x000056130beeacd0 in mongo::OperationContext::waitForConditionOrInterruptNoAssertUntil(mongo::stdx::condition_variable&, mongo::BasicLockableAdapter, mongo::Date_t) ()
#5  0x000056130bea0795 in _ZZN5mongo13Interruptible32waitForConditionOrInterruptUntilISt11unique_lockINS_12latch_detail5LatchEEZNS_28CondVarLockGrantNotification4waitEPNS_16OperationContextENS_8DurationISt5ratioILl1ELl1000EEEEEUlvE_EEbRNS_4stdx18condition_variableERT_NS_6Date_tET0_PNS_10AtomicWordIlEEENKUlSJ_NS0_9WakeSpeedEE1_clESJ_SO_ ()
#6  0x000056130bea0daf in mongo::CondVarLockGrantNotification::wait(mongo::OperationContext*, mongo::Duration<std::ratio<1l, 1000l> >) ()
#7  0x000056130bea29c6 in mongo::LockerImpl::_lockComplete(mongo::OperationContext*, mongo::ResourceId, mongo::LockMode, mongo::Date_t) ()
#8  0x000056130beab773 in mongo::repl::ReplicationStateTransitionLockGuard::waitForLockUntil(mongo::Date_t) ()
#9  0x000056130a3269f7 in mongo::repl::ReplicationCoordinatorImpl::AutoGetRstlForStepUpStepDown::AutoGetRstlForStepUpStepDown(mongo::repl::ReplicationCoordinatorImpl*, mongo::OperationContext*, mongo::repl::ReplicationCoordinator::OpsKillingStateTransitionEnum, mongo::Date_t) ()
#10 0x000056130a34bee9 in mongo::repl::ReplicationCoordinatorImpl::_stepDownFinish(mongo::executor::TaskExecutor::CallbackArgs const&, mongo::executor::TaskExecutor::EventHandle const&) ()
...

The other thread's stacktrace might be different depending on the operation, however, there will be a chunk migration thread on the session migration step (most likely on the SessionCatalogMigrationDestination class).



 Comments   
Comment by Randolph Tan [ 17/May/21 ]

Branch: v4.2

https://github.com/mongodb/mongo/commit/c2295adab43675bfde8c9b2aa5795d9b7fccb6b0

Comment by Githook User [ 23/Apr/21 ]

Author:

{'name': 'Randolph Tan', 'email': 'randolph@10gen.com', 'username': 'renctan'}

Message: SERVER-52564 Deadlock between step down and MongoDOperationContextSession

(cherry picked from commit 6ee5a25cfc951f6e914dcc9f7d1a63d2e7aeaa67)
Branch: v4.4
https://github.com/mongodb/mongo/commit/77530ffba57e2affb7574da8d04b7e91a66aa248

Comment by Githook User [ 04/Feb/21 ]

Author:

{'name': 'Randolph Tan', 'email': 'randolph@10gen.com', 'username': 'renctan'}

Message: SERVER-52564 Deadlock between step down and MongoDOperationContextSession
Branch: master
https://github.com/mongodb/mongo/commit/6ee5a25cfc951f6e914dcc9f7d1a63d2e7aeaa67

Comment by Sergi Mateo Bellido [ 11/Jan/21 ]

marcos.grillo  renctan I found another BF that failed because of this issue: BF-19805 (the main difference is that the second thread, the one that is trying to get the RSTL lock, is trying to migrate some data).

Generated at Thu Feb 08 05:28:23 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.