Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-47582

Stepdown in drain mode can trigger deadlock due to clearing OplogTruncateAfterPoint

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.4.0-rc5, 4.7.0
    • Component/s: Replication
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Backport Requested:
      v4.4
    • Sprint:
      Execution Team 2020-05-04, Execution Team 2020-05-18

      Description

      Found the deadlock when testing a behavior in drain mode. Stepdown command hangs. Attached are the log and waiting graph.

      • OplogApplier-0 thread is waiting for all writers to finish while holding the PBWM lock.
      • Writers are waiting on the RSTL held by the stepdown thread.
      • Stepdown thread is waiting on the PBWM lock in order to clear OplogTruncateAfterPoint, while holding the RSTL in X mode.

      Here's the stepdown thread's stacktrace.

      Thread 66: "conn20" (Thread 0x7f1731b4d700 (LWP 29264))
      #0  0x00007f176410d403 in poll () from /lib64/libc.so.6
      #1  0x00007f1769a477e7 in mongo::transport::TransportLayerASIO::BatonASIO::run (this=0x7f1773a25a90, clkSource=0x7f176d570a90) at /opt/mongodbtoolchain/revisions/toolchain_builder_rhel62_patch_57df61bfe08e9c2805a38d201c5cd6c0ea414778_5daee4e830661503f51984bf_19_10_22_11_16_02/stow/gcc-v3.cZb/include/c++/8.2.0/bits/stl_vector.h:805
      #2  0x00007f1769a44add in mongo::transport::TransportLayerASIO::BatonASIO::run_until (this=this@entry=0x7f1773a25a90, clkSource=0x7f176d570a90, deadline=...) at src/mongo/transport/baton_asio_linux.h:247
      #3  0x00007f1769f4da21 in mongo::Waitable::wait_until<mongo::BasicLockableAdapter>(mongo::Waitable*, mongo::ClockSource*, mongo::stdx::condition_variable&, mongo::BasicLockableAdapter&, std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&)::{lambda()#1}::operator()() const (this=<optimized out>) at src/mongo/util/lockable_adapter.h:57
      #4  mongo::stdx::condition_variable::_runWithNotifyable<mongo::Waitable::wait_until<mongo::BasicLockableAdapter>(mongo::Waitable*, mongo::ClockSource*, mongo::stdx::condition_variable&, mongo::BasicLockableAdapter&, std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&)::{lambda()#1}>(mongo::Notifyable&, mongo::BasicLockableAdapter&&) (cb=..., notifyable=..., this=0x7f17739e5178) at src/mongo/stdx/condition_variable.h:162
      #5  mongo::Waitable::wait_until<mongo::BasicLockableAdapter> (timeout_time=..., lk=..., cv=..., clkSource=<optimized out>, waitable=0x7f1773a25a90) at src/mongo/util/waitable.h:92
      #6  mongo::ClockSource::waitForConditionUntil (this=<optimized out>, cv=..., bla=..., deadline=..., deadline@entry=..., waitable=0x7f1773a25a90) at src/mongo/util/clock_source.cpp:48
      #7  0x00007f1769f43600 in mongo::OperationContext::<lambda()>::operator() (__closure=<optimized out>) at src/mongo/util/lockable_adapter.h:50
      #8  mongo::OperationContext::waitForConditionOrInterruptNoAssertUntil (this=0x7f1773514380, cv=..., m=..., deadline=...) at src/mongo/db/operation_context.cpp:309
      #9  0x00007f1769ef8792 in mongo::Interruptible::<lambda(mongo::Date_t, mongo::Interruptible::WakeSpeed)>::operator()(mongo::Date_t, mongo::Interruptible::WakeSpeed) const (this=0x7f1773514380, deadline=..., speed=mongo::Interruptible::WakeSpeed::kSlow) at src/mongo/util/lockable_adapter.h:50
      #10 0x00007f1769ef8ccf in mongo::Interruptible::<lambda(mongo::Date_t, mongo::Interruptible::WakeSpeed)>::operator() (speed=mongo::Interruptible::WakeSpeed::kSlow, deadline=..., this=<optimized out>) at src/mongo/db/concurrency/lock_state.cpp:246
      #11 mongo::Interruptible::waitForConditionOrInterruptUntil<std::unique_lock<mongo::latch_detail::Latch>, mongo::CondVarLockGrantNotification::wait(mongo::OperationContext*, mongo::Milliseconds)::<lambda()> > (waitTimer=0x0, pred=..., finalDeadline=..., m=..., cv=..., this=0x7f1773514380) at src/mongo/util/interruptible.h:448
      #12 mongo::Interruptible::waitForConditionOrInterruptFor<std::unique_lock<mongo::latch_detail::Latch>, mongo::CondVarLockGrantNotification::wait(mongo::OperationContext*, mongo::Milliseconds)::<lambda()> > (waitTimer=0x0, pred=..., ms=..., m=..., cv=..., this=0x7f1773514380) at src/mongo/util/interruptible.h:480
      #13 mongo::CondVarLockGrantNotification::wait (this=0x7f17739e5128, opCtx=0x7f1773514380, timeout=...) at src/mongo/db/concurrency/lock_state.cpp:245
      #14 0x00007f1769efa9e6 in mongo::LockerImpl::_lockComplete (this=0x7f17739e5080, opCtx=0x7f1773514380, resId=..., mode=mongo::MODE_IS, deadline=...) at src/mongo/db/concurrency/lock_state.cpp:945
      #15 0x00007f1769eeea4b in mongo::Lock::ResourceLock::lock (this=0x7f1731b491b0, opCtx=<optimized out>, mode=<optimized out>, deadline=...) at src/mongo/db/concurrency/d_concurrency.cpp:317
      #16 0x00007f1769eeebd3 in mongo::Lock::GlobalLock::GlobalLock (this=0x7f1731b491a0, opCtx=0x7f1773514380, lockMode=mongo::MODE_IX, deadline=..., behavior=<optimized out>) at src/mongo/db/concurrency/d_concurrency.cpp:152
      #17 0x00007f1769eeecbb in mongo::Lock::DBLock::DBLock (this=0x7f1731b49188, opCtx=0x7f1773514380, db="local", mode=<optimized out>, deadline=...) at src/mongo/db/concurrency/lock_manager_defs.h:106
      #18 0x00007f1768e0785f in mongo::AutoGetDb::AutoGetDb (this=0x7f1731b49160, opCtx=0x7f1773514380, dbName="local", mode=mongo::MODE_IX, deadline=...) at /opt/mongodbtoolchain/revisions/toolchain_builder_rhel62_patch_57df61bfe08e9c2805a38d201c5cd6c0ea414778_5daee4e830661503f51984bf_19_10_22_11_16_02/stow/gcc-v3.cZb/include/c++/8.2.0/ext/new_allocator.h:86
      #19 0x00007f1768e07cfe in mongo::AutoGetCollection::AutoGetCollection (this=0x7f1731b49160, opCtx=0x7f1773514380, nsOrUUID=..., modeColl=mongo::MODE_IX, viewMode=mongo::AutoGetCollection::kViewsForbidden, deadline=...) at src/mongo/base/string_data.h:64
      #20 0x00007f1768314b4e in mongo::repl::StorageInterfaceImpl::<lambda()>::operator()(void) const (__closure=0x7f1731b493d0) at src/mongo/db/repl/storage_interface_impl.cpp:930
      #21 0x00007f17683153cd in mongo::writeConflictRetry<mongo::repl::StorageInterfaceImpl::upsertById(mongo::OperationContext*, const mongo::NamespaceStringOrUUID&, const mongo::BSONElement&, const mongo::BSONObj&)::<lambda()> > (f=..., ns=..., opStr=..., opCtx=0x7f1773514380) at src/mongo/util/fail_point.h:235
      #22 mongo::repl::StorageInterfaceImpl::upsertById (this=<optimized out>, opCtx=<optimized out>, nsOrUUID=..., idKey=..., update=...) at src/mongo/db/repl/storage_interface_impl.cpp:974
      #23 0x00007f176846fc8b in mongo::repl::ReplicationConsistencyMarkersImpl::_upsertOplogTruncateAfterPointDocument (this=<optimized out>, opCtx=0x7f1773514380, updateSpec=owned BSONObj 49 bytes @ 0x7f1773604598) at /opt/mongodbtoolchain/revisions/toolchain_builder_rhel62_patch_57df61bfe08e9c2805a38d201c5cd6c0ea414778_5daee4e830661503f51984bf_19_10_22_11_16_02/stow/gcc-v3.cZb/include/c++/8.2.0/bits/char_traits.h:285
      #24 0x00007f176846ff17 in mongo::repl::ReplicationConsistencyMarkersImpl::setOplogTruncateAfterPoint (this=0x7f176db86b80, opCtx=0x7f1773514380, timestamp=...) at src/mongo/base/string_data.h:74
      #25 0x00007f176832d304 in mongo::repl::ReplicationCoordinatorExternalStateImpl::_stopAsyncUpdatesOfAndClearOplogTruncateAfterPoint (this=0x7f176db98600) at src/mongo/bson/timestamp.h:76
      #26 0x00007f176835d5c8 in mongo::repl::ReplicationCoordinatorImpl::_performPostMemberStateUpdateAction (this=0x7f176dc0a800, action=action@entry=mongo::repl::ReplicationCoordinatorImpl::kActionSteppedDown) at /opt/mongodbtoolchain/revisions/toolchain_builder_rhel62_patch_57df61bfe08e9c2805a38d201c5cd6c0ea414778_5daee4e830661503f51984bf_19_10_22_11_16_02/stow/gcc-v3.cZb/include/c++/8.2.0/bits/unique_ptr.h:342
      #27 0x00007f176835f91a in mongo::repl::ReplicationCoordinatorImpl::<lambda()>::operator()(void) const (__closure=0x7f1731b499d0) at src/mongo/db/repl/replication_coordinator_impl.cpp:2582
      #28 0x00007f17683711f0 in mongo::repl::ReplicationCoordinatorImpl::stepDown (this=0x7f176dc0a800, opCtx=<optimized out>, force=<optimized out>, waitTime=..., stepdownTime=...) at src/mongo/db/repl/replication_coordinator_impl.cpp:2672
      #29 0x00007f1768325bc0 in mongo::repl::CmdReplSetStepDown::run (this=<optimized out>, opCtx=0x7f1773514380, cmdObj=..., result=...) at src/mongo/util/duration.h:241
      #30 0x00007f1768fa904a in mongo::BasicCommand::runWithReplyBuilder (replyBuilder=0x7f1773603c30, cmdObj=owned BSONObj 242 bytes @ 0x7f1772eca91d, db="admin", opCtx=0x7f1773514380, this=0x7f176b347c40 <mongo::repl::cmdReplSetStepDown>) at src/mongo/db/commands.h:799
      #31 mongo::BasicCommandWithReplyBuilderInterface::Invocation::run (this=0x7f17739e1a20, opCtx=0x7f1773514380, result=0x7f1773603c30) at src/mongo/db/commands.cpp:764
      #32 0x00007f1768fa2f4f in mongo::CommandHelpers::runCommandInvocation (opCtx=0x7f1773514380, request=..., invocation=0x7f17739e1a20, response=0x7f1773603c30) at src/mongo/db/commands.cpp:184
      #33 0x00007f176872be15 in mongo::(anonymous namespace)::runCommandImpl (opCtx=<optimized out>, invocation=<optimized out>, request=..., replyBuilder=0x7f1773603c30, startOperationTime=..., behaviors=..., extraFieldsBuilder=0x7f1731b4a3b0, sessionOptions=...) at src/mongo/db/service_entry_point_common.cpp:851
      #34 0x00007f176872ee19 in mongo::(anonymous namespace)::execCommandDatabase (opCtx=<optimized out>, command=<optimized out>, request=..., replyBuilder=<optimized out>, behaviors=...) at src/mongo/db/service_entry_point_common.cpp:1177
      

       

      It's unclear why this isn't caught by existing stepdown in drain mode tests, like jstests/replsets/step_down_during_draining.js and its friends, probably because the burn-in tests in my patch build runs much more times than master branch. My new test timed out 3 time in burn-in tests. The logic was introduced in SERVER-46984 on 3/20/20. SERVER-46201 refactored the code after that.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              dianna.hohensee Dianna Hohensee
              Reporter:
              siyuan.zhou Siyuan Zhou
              Participants:
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: