[SERVER-47582] Stepdown in drain mode can trigger deadlock due to clearing OplogTruncateAfterPoint Created: 16/Apr/20  Updated: 29/Oct/23  Resolved: 06/May/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.4.0-rc5, 4.7.0

Type: Bug Priority: Major - P3
Reporter: Siyuan Zhou Assignee: Dianna Hohensee (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Related
related to SERVER-46984 Stop async updates to the oplogTrunca... Closed
related to SERVER-46201 Implement a generic ReplicaSetStateAw... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4
Sprint: Execution Team 2020-05-04, Execution Team 2020-05-18
Participants:

 Description   

Found the deadlock when testing a behavior in drain mode. Stepdown command hangs. Attached are the log and waiting graph.

  • OplogApplier-0 thread is waiting for all writers to finish while holding the PBWM lock.
  • Writers are waiting on the RSTL held by the stepdown thread.
  • Stepdown thread is waiting on the PBWM lock in order to clear OplogTruncateAfterPoint, while holding the RSTL in X mode.

Here's the stepdown thread's stacktrace.

Thread 66: "conn20" (Thread 0x7f1731b4d700 (LWP 29264))
#0  0x00007f176410d403 in poll () from /lib64/libc.so.6
#1  0x00007f1769a477e7 in mongo::transport::TransportLayerASIO::BatonASIO::run (this=0x7f1773a25a90, clkSource=0x7f176d570a90) at /opt/mongodbtoolchain/revisions/toolchain_builder_rhel62_patch_57df61bfe08e9c2805a38d201c5cd6c0ea414778_5daee4e830661503f51984bf_19_10_22_11_16_02/stow/gcc-v3.cZb/include/c++/8.2.0/bits/stl_vector.h:805
#2  0x00007f1769a44add in mongo::transport::TransportLayerASIO::BatonASIO::run_until (this=this@entry=0x7f1773a25a90, clkSource=0x7f176d570a90, deadline=...) at src/mongo/transport/baton_asio_linux.h:247
#3  0x00007f1769f4da21 in mongo::Waitable::wait_until<mongo::BasicLockableAdapter>(mongo::Waitable*, mongo::ClockSource*, mongo::stdx::condition_variable&, mongo::BasicLockableAdapter&, std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&)::{lambda()#1}::operator()() const (this=<optimized out>) at src/mongo/util/lockable_adapter.h:57
#4  mongo::stdx::condition_variable::_runWithNotifyable<mongo::Waitable::wait_until<mongo::BasicLockableAdapter>(mongo::Waitable*, mongo::ClockSource*, mongo::stdx::condition_variable&, mongo::BasicLockableAdapter&, std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&)::{lambda()#1}>(mongo::Notifyable&, mongo::BasicLockableAdapter&&) (cb=..., notifyable=..., this=0x7f17739e5178) at src/mongo/stdx/condition_variable.h:162
#5  mongo::Waitable::wait_until<mongo::BasicLockableAdapter> (timeout_time=..., lk=..., cv=..., clkSource=<optimized out>, waitable=0x7f1773a25a90) at src/mongo/util/waitable.h:92
#6  mongo::ClockSource::waitForConditionUntil (this=<optimized out>, cv=..., bla=..., deadline=..., deadline@entry=..., waitable=0x7f1773a25a90) at src/mongo/util/clock_source.cpp:48
#7  0x00007f1769f43600 in mongo::OperationContext::<lambda()>::operator() (__closure=<optimized out>) at src/mongo/util/lockable_adapter.h:50
#8  mongo::OperationContext::waitForConditionOrInterruptNoAssertUntil (this=0x7f1773514380, cv=..., m=..., deadline=...) at src/mongo/db/operation_context.cpp:309
#9  0x00007f1769ef8792 in mongo::Interruptible::<lambda(mongo::Date_t, mongo::Interruptible::WakeSpeed)>::operator()(mongo::Date_t, mongo::Interruptible::WakeSpeed) const (this=0x7f1773514380, deadline=..., speed=mongo::Interruptible::WakeSpeed::kSlow) at src/mongo/util/lockable_adapter.h:50
#10 0x00007f1769ef8ccf in mongo::Interruptible::<lambda(mongo::Date_t, mongo::Interruptible::WakeSpeed)>::operator() (speed=mongo::Interruptible::WakeSpeed::kSlow, deadline=..., this=<optimized out>) at src/mongo/db/concurrency/lock_state.cpp:246
#11 mongo::Interruptible::waitForConditionOrInterruptUntil<std::unique_lock<mongo::latch_detail::Latch>, mongo::CondVarLockGrantNotification::wait(mongo::OperationContext*, mongo::Milliseconds)::<lambda()> > (waitTimer=0x0, pred=..., finalDeadline=..., m=..., cv=..., this=0x7f1773514380) at src/mongo/util/interruptible.h:448
#12 mongo::Interruptible::waitForConditionOrInterruptFor<std::unique_lock<mongo::latch_detail::Latch>, mongo::CondVarLockGrantNotification::wait(mongo::OperationContext*, mongo::Milliseconds)::<lambda()> > (waitTimer=0x0, pred=..., ms=..., m=..., cv=..., this=0x7f1773514380) at src/mongo/util/interruptible.h:480
#13 mongo::CondVarLockGrantNotification::wait (this=0x7f17739e5128, opCtx=0x7f1773514380, timeout=...) at src/mongo/db/concurrency/lock_state.cpp:245
#14 0x00007f1769efa9e6 in mongo::LockerImpl::_lockComplete (this=0x7f17739e5080, opCtx=0x7f1773514380, resId=..., mode=mongo::MODE_IS, deadline=...) at src/mongo/db/concurrency/lock_state.cpp:945
#15 0x00007f1769eeea4b in mongo::Lock::ResourceLock::lock (this=0x7f1731b491b0, opCtx=<optimized out>, mode=<optimized out>, deadline=...) at src/mongo/db/concurrency/d_concurrency.cpp:317
#16 0x00007f1769eeebd3 in mongo::Lock::GlobalLock::GlobalLock (this=0x7f1731b491a0, opCtx=0x7f1773514380, lockMode=mongo::MODE_IX, deadline=..., behavior=<optimized out>) at src/mongo/db/concurrency/d_concurrency.cpp:152
#17 0x00007f1769eeecbb in mongo::Lock::DBLock::DBLock (this=0x7f1731b49188, opCtx=0x7f1773514380, db="local", mode=<optimized out>, deadline=...) at src/mongo/db/concurrency/lock_manager_defs.h:106
#18 0x00007f1768e0785f in mongo::AutoGetDb::AutoGetDb (this=0x7f1731b49160, opCtx=0x7f1773514380, dbName="local", mode=mongo::MODE_IX, deadline=...) at /opt/mongodbtoolchain/revisions/toolchain_builder_rhel62_patch_57df61bfe08e9c2805a38d201c5cd6c0ea414778_5daee4e830661503f51984bf_19_10_22_11_16_02/stow/gcc-v3.cZb/include/c++/8.2.0/ext/new_allocator.h:86
#19 0x00007f1768e07cfe in mongo::AutoGetCollection::AutoGetCollection (this=0x7f1731b49160, opCtx=0x7f1773514380, nsOrUUID=..., modeColl=mongo::MODE_IX, viewMode=mongo::AutoGetCollection::kViewsForbidden, deadline=...) at src/mongo/base/string_data.h:64
#20 0x00007f1768314b4e in mongo::repl::StorageInterfaceImpl::<lambda()>::operator()(void) const (__closure=0x7f1731b493d0) at src/mongo/db/repl/storage_interface_impl.cpp:930
#21 0x00007f17683153cd in mongo::writeConflictRetry<mongo::repl::StorageInterfaceImpl::upsertById(mongo::OperationContext*, const mongo::NamespaceStringOrUUID&, const mongo::BSONElement&, const mongo::BSONObj&)::<lambda()> > (f=..., ns=..., opStr=..., opCtx=0x7f1773514380) at src/mongo/util/fail_point.h:235
#22 mongo::repl::StorageInterfaceImpl::upsertById (this=<optimized out>, opCtx=<optimized out>, nsOrUUID=..., idKey=..., update=...) at src/mongo/db/repl/storage_interface_impl.cpp:974
#23 0x00007f176846fc8b in mongo::repl::ReplicationConsistencyMarkersImpl::_upsertOplogTruncateAfterPointDocument (this=<optimized out>, opCtx=0x7f1773514380, updateSpec=owned BSONObj 49 bytes @ 0x7f1773604598) at /opt/mongodbtoolchain/revisions/toolchain_builder_rhel62_patch_57df61bfe08e9c2805a38d201c5cd6c0ea414778_5daee4e830661503f51984bf_19_10_22_11_16_02/stow/gcc-v3.cZb/include/c++/8.2.0/bits/char_traits.h:285
#24 0x00007f176846ff17 in mongo::repl::ReplicationConsistencyMarkersImpl::setOplogTruncateAfterPoint (this=0x7f176db86b80, opCtx=0x7f1773514380, timestamp=...) at src/mongo/base/string_data.h:74
#25 0x00007f176832d304 in mongo::repl::ReplicationCoordinatorExternalStateImpl::_stopAsyncUpdatesOfAndClearOplogTruncateAfterPoint (this=0x7f176db98600) at src/mongo/bson/timestamp.h:76
#26 0x00007f176835d5c8 in mongo::repl::ReplicationCoordinatorImpl::_performPostMemberStateUpdateAction (this=0x7f176dc0a800, action=action@entry=mongo::repl::ReplicationCoordinatorImpl::kActionSteppedDown) at /opt/mongodbtoolchain/revisions/toolchain_builder_rhel62_patch_57df61bfe08e9c2805a38d201c5cd6c0ea414778_5daee4e830661503f51984bf_19_10_22_11_16_02/stow/gcc-v3.cZb/include/c++/8.2.0/bits/unique_ptr.h:342
#27 0x00007f176835f91a in mongo::repl::ReplicationCoordinatorImpl::<lambda()>::operator()(void) const (__closure=0x7f1731b499d0) at src/mongo/db/repl/replication_coordinator_impl.cpp:2582
#28 0x00007f17683711f0 in mongo::repl::ReplicationCoordinatorImpl::stepDown (this=0x7f176dc0a800, opCtx=<optimized out>, force=<optimized out>, waitTime=..., stepdownTime=...) at src/mongo/db/repl/replication_coordinator_impl.cpp:2672
#29 0x00007f1768325bc0 in mongo::repl::CmdReplSetStepDown::run (this=<optimized out>, opCtx=0x7f1773514380, cmdObj=..., result=...) at src/mongo/util/duration.h:241
#30 0x00007f1768fa904a in mongo::BasicCommand::runWithReplyBuilder (replyBuilder=0x7f1773603c30, cmdObj=owned BSONObj 242 bytes @ 0x7f1772eca91d, db="admin", opCtx=0x7f1773514380, this=0x7f176b347c40 <mongo::repl::cmdReplSetStepDown>) at src/mongo/db/commands.h:799
#31 mongo::BasicCommandWithReplyBuilderInterface::Invocation::run (this=0x7f17739e1a20, opCtx=0x7f1773514380, result=0x7f1773603c30) at src/mongo/db/commands.cpp:764
#32 0x00007f1768fa2f4f in mongo::CommandHelpers::runCommandInvocation (opCtx=0x7f1773514380, request=..., invocation=0x7f17739e1a20, response=0x7f1773603c30) at src/mongo/db/commands.cpp:184
#33 0x00007f176872be15 in mongo::(anonymous namespace)::runCommandImpl (opCtx=<optimized out>, invocation=<optimized out>, request=..., replyBuilder=0x7f1773603c30, startOperationTime=..., behaviors=..., extraFieldsBuilder=0x7f1731b4a3b0, sessionOptions=...) at src/mongo/db/service_entry_point_common.cpp:851
#34 0x00007f176872ee19 in mongo::(anonymous namespace)::execCommandDatabase (opCtx=<optimized out>, command=<optimized out>, request=..., replyBuilder=<optimized out>, behaviors=...) at src/mongo/db/service_entry_point_common.cpp:1177

 

It's unclear why this isn't caught by existing stepdown in drain mode tests, like jstests/replsets/step_down_during_draining.js and its friends, probably because the burn-in tests in my patch build runs much more times than master branch. My new test timed out 3 time in burn-in tests. The logic was introduced in SERVER-46984 on 3/20/20. SERVER-46201 refactored the code after that.



 Comments   
Comment by Githook User [ 08/May/20 ]

Author:

{'name': 'Dianna Hohensee', 'email': 'dianna.hohensee@mongodb.com', 'username': 'DiannaHohensee'}

Message: SERVER-47582 clearing the oplogTruncateAfterPoint during stepdown must not take the PBWM

(cherry picked from commit da4a8e0e85627d6febd8bce3fd87f221e0ff97c6)
Branch: v4.4
https://github.com/mongodb/mongo/commit/9573f636b18c3032e1bf548ae04afe4c6ffa0162

Comment by Githook User [ 06/May/20 ]

Author:

{'name': 'Dianna Hohensee', 'email': 'dianna.hohensee@mongodb.com', 'username': 'DiannaHohensee'}

Message: SERVER-47582 clearing the oplogTruncateAfterPoint during stepdown must not take the PBWM
Branch: master
https://github.com/mongodb/mongo/commit/da4a8e0e85627d6febd8bce3fd87f221e0ff97c6

Comment by Suganthi Mani [ 01/May/20 ]

I also agree to judah.schvimer's point of view as I am also not sure how safe for onStepDownHook to run with ShouldNotConflictWithSecondaryBatchApplicationBlock which might requires investigation. Also, since execution team have plans to remove the PBWM lock, I don't think it's worth additional effort.

Comment by Judah Schvimer [ 01/May/20 ]

I lean towards the more targeted solution. We'd like to remove the PBWM in SERVER-38341, so my inclination is to avoid extra work. I'm also not 100% confident that putting ShouldNotConflictWithSecondaryBatchApplicationBlock around all of onStepDownHook is safe, and I feel like that would take extra investigation to decide.

Comment by Dianna Hohensee (Inactive) [ 01/May/20 ]

judah.schvimer suganthi.mani How would you like to proceed? If you guys would like a more global solution for stepdown either never taking the PBWM by places a ShouldNotConflictWithSecondaryBatchApplicationBlock at a high level of the stepdown code, or preemptively taking the PBWM in intent mode, then I need some guidance – or I can hand the ticket back.

Comment by Suganthi Mani [ 30/Apr/20 ]

It also makes me to think that we should run the entire onStepDownHook under ShouldNotConflictWithSecondaryBatchApplicationBlock and not just this. This prevents any kind of future deadlocks like this.

I am also thinking of making stepdown thread to first acquire PBWM lock in some intent mode and then acquire RSTL lock in X mode. The problem with this will be, we won't be able to step down the node while the node is draining. I am not sure whether the draining step blocking step down will be acceptable.

Comment by Suganthi Mani [ 30/Apr/20 ]

dianna.hohensee proposal sounds good to me too. Here is the basic problem, the stepdown thread currently violates the locking order and that lead to the deadlock. Basically, the locking order is
1) PBWM lock
2) RSTL lock
3) Global lock
4) DB lock
5) Collection lock.

To my understanding, PBWM doesn't guarantee any exclusive access. The main purpose of PBWM, I think, is to prevent readers from reading the hole data while the oplog application is in progress. RSTL lock should be good enough to guarantee the exclusive access.

Comment by Siyuan Zhou [ 30/Apr/20 ]

dianna.hohensee, the proposal sounds good to me. However, if I remembered correctly truncateAfterPoint is also used on secondary in oplog application. If stepdown doesn't take PBWM, how do we guarantee the exclusive access? I'm curious about Suganthi's thoughts too.

Comment by Judah Schvimer [ 30/Apr/20 ]

suganthi.mani, I think you're the resident expert on the stepdown locking. Does the above sound good to you?

Comment by Dianna Hohensee (Inactive) [ 30/Apr/20 ]

It sounds like the cycle is OplogApplier -> writers -> stepdown -> OplogApplier. Stepdown wants the PBWM that OplogApplier has.

So I'll break the cycle by having stepdown not take the PBWM. Specifically, the write to clear the oplogTruncateAfterPoint will skip taking the PBWM, using the existing ShouldNotConflictWithSecondaryBatchApplicationBlock RAII type.

Comment by Tess Avitabile (Inactive) [ 16/Apr/20 ]

Thanks, milkie!

Comment by Eric Milkie [ 16/Apr/20 ]

I think this problem actually existed before SERVER-46984. Since the OplogTruncateAfterPoint changes were a big part of the Replicate Before Journaling project, we can take over the diagnosis and fix for this issue – please reassign it to Execution.

Generated at Thu Feb 08 05:14:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.