-
Type:
Bug
-
Resolution: Gone away
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Sharding
-
None
-
ALL
-
Sharding EMEA 2022-10-17, Sharding EMEA 2022-10-31
-
145
-
None
-
3
-
None
-
None
-
None
-
None
-
None
-
None
Observing a failure on setfcv_aborts_reshard_collection.js, there are a number of log lines indicating that the global lock is being held and not released:
[js_test:setfcv_aborts_reshard_collection] d20792| 2022-10-11T06:50:59.435+00:00 D2 RECOVERY 23986 [Checkpointer] "Performing stable checkpoint.","attr":{"stableTimestamp":{"$timestamp":{"t":1665470518,"i":1}},"oplogNeededForRollback":"LockTimeout: Unable to acquire IX lock on '{2305843009213693954: Global, 2}' within 100ms. opId: 21421, op: OldestActiveTxnTimestamp, connId: 0."}
The current holder of this lock is the OplogApplier-0 thread:
"lockAddr": "0x562b41c00d00", "resourceId": "{2305843009213693954: Global, 2}", "granted": [{ "lockRequest": "0x50", "lockRequestAddr": "0x562b47432000", "thread": "139974578878208", "mode": "X", "convertMode": "NONE", "enqueueAtFront": true, "compatibleFirst": true, "debugInfo": "", "clientInfo": { "desc": "OplogApplier-0", "opid": 79 } } ],
That thread is currently waiting on a condition variable while trying to create the ReadyRangeDeletionsProcessor:
[js_test:setfcv_aborts_reshard_collection] sh110650| Thread 54: "OplogApplier-0" (Thread 0x7f4e5f0d3700 (LWP 91655)) [js_test:setfcv_aborts_reshard_collection] sh110650| #0 0x00007f4e7ae5ddef in poll () from /lib/x86_64-linux-gnu/libc.so.6 [js_test:setfcv_aborts_reshard_collection] sh110650| #1 0x0000562b3ea5223d in poll (__timeout=-1, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46 [js_test:setfcv_aborts_reshard_collection] sh110650| #2 mongo::transport::TransportLayerASIO::BatonASIO::<lambda()>::operator() (__closure=<optimized out>) at src/mongo/transport/baton_asio_linux.cpp:359 [js_test:setfcv_aborts_reshard_collection] sh110650| #3 mongo::transport::TransportLayerASIO::BatonASIO::_poll[abi:cxx11](std::unique_lock<mongo::latch_detail::Mutex>&, mongo::ClockSource*) (this=0x562b473cb010, lk=..., clkSource=<optimized out>) at src/mongo/transport/baton_asio_linux.cpp:366 [js_test:setfcv_aborts_reshard_collection] sh110650| #4 0x0000562b3ea53325 in mongo::transport::TransportLayerASIO::BatonASIO::run (this=0x562b473cb010, clkSource=0x562b417611e0) at src/mongo/transport/baton_asio_linux.cpp:210 [js_test:setfcv_aborts_reshard_collection] sh110650| #5 0x0000562b3f118d84 in mongo::Waitable::wait<mongo::BasicLockableAdapter>(mongo::Waitable*, mongo::ClockSource*, mongo::stdx::condition_variable&, mongo::BasicLockableAdapter&)::{lambda()#1}::operator()() const (this=<optimized out>) at src/mongo/util/lockable_adapter.h:57 [js_test:setfcv_aborts_reshard_collection] sh110650| #6 mongo::stdx::condition_variable::_runWithNotifyable<mongo::Waitable::wait<mongo::BasicLockableAdapter>(mongo::Waitable*, mongo::ClockSource*, mongo::stdx::condition_variable&, mongo::BasicLockableAdapter&)::{lambda()#1}>(mongo::Notifyable&, mongo::BasicLockableAdapter&&) (cb=..., notifyable=warning: RTTI symbol not found for class 'mongo::transport::TransportLayerASIO::BatonASIO' [js_test:setfcv_aborts_reshard_collection] sh110650| ..., this=0x562b47c43900) at src/mongo/stdx/condition_variable.h:162 [js_test:setfcv_aborts_reshard_collection] sh110650| #7 mongo::Waitable::wait<mongo::BasicLockableAdapter> (lk=..., cv=..., clkSource=0x562b417611e0, waitable=0x562b473cb010) at src/mongo/util/waitable.h:61 [js_test:setfcv_aborts_reshard_collection] sh110650| #8 mongo::OperationContext::<lambda()>::operator() (__closure=<optimized out>) at src/mongo/db/operation_context.cpp:325 [js_test:setfcv_aborts_reshard_collection] sh110650| #9 mongo::OperationContext::waitForConditionOrInterruptNoAssertUntil (this=<optimized out>, cv=..., m=..., deadline=...) at src/mongo/db/operation_context.cpp:330 [js_test:setfcv_aborts_reshard_collection] sh110650| #10 0x0000562b3c8cc772 in mongo::Interruptible::waitForConditionOrInterruptUntil<std::unique_lock<mongo::latch_detail::Latch>, mongo::RangeDeleterService::ReadyRangeDeletionsProcessor::ReadyRangeDeletionsProcessor(mongo::OperationContext*)::{lambda()#2}>(mongo::stdx::condition_variable&, std::unique_lock<mongo::latch_detail::Latch>&, mongo::Date_t, mongo::RangeDeleterService::ReadyRangeDeletionsProcessor::ReadyRangeDeletionsProcessor(mongo::OperationContext*)::{lambda()#2})::{lambda(auto:1&, mongo::Interruptible::WakeSpeed)#3}::operator() (this=0x562b4742cd80, deadline=..., speed=mongo::Interruptible::WakeSpeed::kSlow) at src/mongo/util/lockable_adapter.h:50 [js_test:setfcv_aborts_reshard_collection] sh110650| #11 0x0000562b3c8ce94a in mongo::Interruptible::waitForConditionOrInterruptUntil<std::unique_lock<mongo::latch_detail::Latch>, mongo::RangeDeleterService::ReadyRangeDeletionsProcessor::ReadyRangeDeletionsProcessor(mongo::OperationContext*)::{lambda()#2}>(mongo::stdx::condition_variable&, std::unique_lock<mongo::latch_detail::Latch>&, mongo::Date_t, mongo::RangeDeleterService::ReadyRangeDeletionsProcessor::ReadyRangeDeletionsProcessor(mongo::OperationContext*)::{lambda()#2})::{lambda(auto:1&, mongo::Interruptible::WakeSpeed)#4}::operator() (speed=mongo::Interruptible::WakeSpeed::kSlow, deadline=..., this=<optimized out>) at src/mongo/db/s/range_deleter_service.h:117 [js_test:setfcv_aborts_reshard_collection] sh110650| #12 mongo::Interruptible::waitForConditionOrInterruptUntil<std::unique_lock<mongo::latch_detail::Latch>, mongo::RangeDeleterService::ReadyRangeDeletionsProcessor::ReadyRangeDeletionsProcessor(mongo::OperationContext*)::{lambda()#2}>(mongo::stdx::condition_variable&, std::unique_lock<mongo::latch_detail::Latch>&, mongo::Date_t, mongo::RangeDeleterService::ReadyRangeDeletionsProcessor::ReadyRangeDeletionsProcessor(mongo::OperationContext*)::{lambda()#2}) (pred=..., finalDeadline=..., m=..., cv=..., this=0x562b4742cd80) at src/mongo/util/interruptible.h:443 [js_test:setfcv_aborts_reshard_collection] sh110650| #13 mongo::Interruptible::waitForConditionOrInterrupt<std::unique_lock<mongo::latch_detail::Latch>, mongo::RangeDeleterService::ReadyRangeDeletionsProcessor::ReadyRangeDeletionsProcessor(mongo::OperationContext*)::{lambda()#2}>(mongo::stdx::condition_variable&, std::unique_lock<mongo::latch_detail::Latch>&, mongo::RangeDeleterService::ReadyRangeDeletionsProcessor::ReadyRangeDeletionsProcessor(mongo::OperationContext*)::{lambda()#2}) (pred=..., m=..., cv=..., this=0x562b4742cd80) at src/mongo/util/interruptible.h:458 [js_test:setfcv_aborts_reshard_collection] sh110650| #14 mongo::RangeDeleterService::ReadyRangeDeletionsProcessor::ReadyRangeDeletionsProcessor (this=0x562b47c43900, opCtx=0x562b4742cd80) at src/mongo/db/s/range_deleter_service.h:116
Upon creation, the ReadyRangeDeletionsProcessor will spawn a thread and then wait for that thread to construct an opCtx before proceeding. From the logs, we see that the range-deleter thread has already proceeding past this point and is waiting for its queue to be filled:
[js_test:setfcv_aborts_reshard_collection] sh110650| Thread 63: "range-deleter" (Thread 0x7f4e5a8ca700 (LWP 91711)) [js_test:setfcv_aborts_reshard_collection] sh110650| #0 0x00007f4e7ae5ddef in poll () from /lib/x86_64-linux-gnu/libc.so.6 [js_test:setfcv_aborts_reshard_collection] sh110650| #1 0x0000562b3ea5223d in poll (__timeout=-1, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46 [js_test:setfcv_aborts_reshard_collection] sh110650| #2 mongo::transport::TransportLayerASIO::BatonASIO::<lambda()>::operator() (__closure=<optimized out>) at src/mongo/transport/baton_asio_linux.cpp:359 [js_test:setfcv_aborts_reshard_collection] sh110650| #3 mongo::transport::TransportLayerASIO::BatonASIO::_poll[abi:cxx11](std::unique_lock<mongo::latch_detail::Mutex>&, mongo::ClockSource*) (this=0x562b47c80910, lk=..., clkSource=<optimized out>) at src/mongo/transport/baton_asio_linux.cpp:366 [js_test:setfcv_aborts_reshard_collection] sh110650| #4 0x0000562b3ea53325 in mongo::transport::TransportLayerASIO::BatonASIO::run (this=0x562b47c80910, clkSource=0x562b417611e0) at src/mongo/transport/baton_asio_linux.cpp:210 [js_test:setfcv_aborts_reshard_collection] sh110650| #5 0x0000562b3f118d84 in mongo::Waitable::wait<mongo::BasicLockableAdapter>(mongo::Waitable*, mongo::ClockSource*, mongo::stdx::condition_variable&, mongo::BasicLockableAdapter&)::{lambda()#1}::operator()() const (this=<optimized out>) at src/mongo/util/lockable_adapter.h:57 [js_test:setfcv_aborts_reshard_collection] sh110650| #6 mongo::stdx::condition_variable::_runWithNotifyable<mongo::Waitable::wait<mongo::BasicLockableAdapter>(mongo::Waitable*, mongo::ClockSource*, mongo::stdx::condition_variable&, mongo::BasicLockableAdapter&)::{lambda()#1}>(mongo::Notifyable&, mongo::BasicLockableAdapter&&) (cb=..., notifyable=warning: RTTI symbol not found for class 'mongo::transport::TransportLayerASIO::BatonASIO' [js_test:setfcv_aborts_reshard_collection] sh110650| ..., this=0x562b47c43900) at src/mongo/stdx/condition_variable.h:162 [js_test:setfcv_aborts_reshard_collection] sh110650| #7 mongo::Waitable::wait<mongo::BasicLockableAdapter> (lk=..., cv=..., clkSource=0x562b417611e0, waitable=0x562b47c80910) at src/mongo/util/waitable.h:61 [js_test:setfcv_aborts_reshard_collection] sh110650| #8 mongo::OperationContext::<lambda()>::operator() (__closure=<optimized out>) at src/mongo/db/operation_context.cpp:325 [js_test:setfcv_aborts_reshard_collection] sh110650| #9 mongo::OperationContext::waitForConditionOrInterruptNoAssertUntil (this=<optimized out>, cv=..., m=..., deadline=...) at src/mongo/db/operation_context.cpp:330 [js_test:setfcv_aborts_reshard_collection] sh110650| #10 0x0000562b3c8c4f42 in mongo::Interruptible::<lambda(mongo::Date_t, mongo::Interruptible::WakeSpeed)>::operator()(mongo::Date_t, mongo::Interruptible::WakeSpeed) const (this=0x562b47cdb240, deadline=..., speed=mongo::Interruptible::WakeSpeed::kSlow) at src/mongo/util/lockable_adapter.h:50 [js_test:setfcv_aborts_reshard_collection] sh110650| #11 0x0000562b3c8ca852 in mongo::Interruptible::<lambda(mongo::Date_t, mongo::Interruptible::WakeSpeed)>::operator() (speed=mongo::Interruptible::WakeSpeed::kSlow, deadline=..., this=<optimized out>) at src/third_party/boost/boost/optional/detail/optional_aligned_storage.hpp:64 [js_test:setfcv_aborts_reshard_collection] sh110650| #12 mongo::Interruptible::waitForConditionOrInterruptUntil<std::unique_lock<mongo::latch_detail::Latch>, mongo::RangeDeleterService::ReadyRangeDeletionsProcessor::_runRangeDeletions()::<lambda()> > (pred=..., finalDeadline=..., m=..., cv=..., this=0x562b47cdb240) at src/mongo/util/interruptible.h:443 [js_test:setfcv_aborts_reshard_collection] sh110650| #13 mongo::Interruptible::waitForConditionOrInterrupt<std::unique_lock<mongo::latch_detail::Latch>, mongo::RangeDeleterService::ReadyRangeDeletionsProcessor::_runRangeDeletions()::<lambda()> > (pred=..., m=..., cv=..., this=0x562b47cdb240) at src/mongo/util/interruptible.h:458 [js_test:setfcv_aborts_reshard_collection] sh110650| #14 mongo::RangeDeleterService::ReadyRangeDeletionsProcessor::_runRangeDeletions (this=<optimized out>) at src/mongo/db/s/range_deleter_service.cpp:112
However, the mutex and condition variable used to wait for the spawned thread to create its opCtx, and that the spawned thread later uses to wait on its queue are the same. In this case, the spawned thread notified the condition variable and then grabbed the mutex again, before the spawning thread could wake up and grab the mutex itself, causing the hang.