Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-77115

Investigate Why move_primary_donor_cleaned_up_if_coordinator_steps_up_aborted.js Fails Without Sleep

    • Type: Icon: Bug Bug
    • Resolution: Won't Do
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Sharding NYC
    • ALL
    • 135

      The move_primary_donor_cleaned_up_if_coordinator_steps_up_aborted.js test frequently times out if the sleep in the test is not present, but remains stable if it is present. When timing out, the test appears to hang when trying to join the movePrimary command with a stack that looks similar to this:

      Thread 92 (Thread 0x7f7b5e3f2700 (LWP 13015) "conn42"):
      #0  0x00007f7ba2b6b7e1 in poll () from /lib64/libc.so.6
      #1  0x00007f7ba2356152 in poll (__timeout=-1, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:46
      #2  operator() (__closure=<optimized out>) at src/mongo/transport/asio/asio_networking_baton.cpp:383
      #3  mongo::transport::AsioNetworkingBaton::_poll[abi:cxx11](std::unique_lock<mongo::latch_detail::Mutex>&, mongo::ClockSource*) (this=this@entry=0x5631b3522f10, lk=..., clkSource=clkSource@entry=0x5631ab4848a0) at src/mongo/transport/asio/asio_networking_baton.cpp:390
      #4  0x00007f7ba23575bc in mongo::transport::AsioNetworkingBaton::run (this=0x5631b3522f10, clkSource=0x5631ab4848a0) at src/mongo/transport/asio/asio_networking_baton.cpp:210
      #5  0x00007f7b9a118558 in mongo::Waitable::wait<mongo::BasicLockableAdapter>(mongo::Waitable*, mongo::ClockSource*, mongo::stdx::condition_variable&, mongo::BasicLockableAdapter&)::{lambda()#1}::operator()() const (__closure=0x7f7b5e3eb030) at src/mongo/util/waitable.h:63
      #6  mongo::stdx::condition_variable::_runWithNotifyable<mongo::Waitable::wait<mongo::BasicLockableAdapter>(mongo::Waitable*, mongo::ClockSource*, mongo::stdx::condition_variable&, mongo::BasicLockableAdapter&)::{lambda()#1}>(mongo::Notifyable&, mongo::BasicLockableAdapter&&) (this=0x5631ab50d5c8, notifyable=..., cb=...) at src/mongo/stdx/condition_variable.h:162
      #7  0x00007f7b9a1164d4 in mongo::Waitable::wait<mongo::BasicLockableAdapter> (lk=..., cv=..., clkSource=<optimized out>, waitable=<optimized out>) at src/mongo/util/waitable.h:61
      #8  operator() (__closure=<optimized out>) at src/mongo/db/operation_context.cpp:321
      #9  mongo::OperationContext::waitForConditionOrInterruptNoAssertUntil (this=0x5631b3479d40, cv=..., m=..., deadline=...) at src/mongo/db/operation_context.cpp:326
      #10 0x00007f7ba129d701 in operator() (__closure=__closure@entry=0x7f7b5e3eb1b0, deadline=..., deadline@entry=..., speed=mongo::Interruptible::WakeSpeed::kSlow) at src/mongo/util/interruptible.h:307
      #11 0x00007f7ba12a3770 in operator() (speed=mongo::Interruptible::WakeSpeed::kSlow, deadline=..., __closure=<synthetic pointer>) at src/mongo/util/interruptible.h:342
      #12 mongo::Interruptible::waitForConditionOrInterruptUntil<std::unique_lock<mongo::latch_detail::Mutex>, mongo::ShardingDDLCoordinatorService::waitForRecoveryCompletion(mongo::OperationContext*) const::<lambda()> > (pred=..., finalDeadline=..., m=..., cv=..., this=0x7f7b5e3eb1b0) at src/mongo/util/interruptible.h:365
      #13 mongo::Interruptible::waitForConditionOrInterrupt<std::unique_lock<mongo::latch_detail::Mutex>, mongo::ShardingDDLCoordinatorService::waitForRecoveryCompletion(mongo::OperationContext*) const::<lambda()> > (pred=..., m=..., cv=..., this=0x7f7b5e3eb1b0) at src/mongo/util/interruptible.h:380
      #14 mongo::ShardingDDLCoordinatorService::waitForRecoveryCompletion (this=this@entry=0x5631ab50d400, opCtx=opCtx@entry=0x5631b3479d40) at src/mongo/db/s/sharding_ddl_coordinator_service.cpp:225
      #15 0x00007f7ba12a38d7 in mongo::ShardingDDLCoordinatorService::getOrCreateInstance (this=0x5631ab50d400, opCtx=opCtx@entry=0x5631b3479d40, coorDoc=owned BSONObj 187 bytes @ 0x5631b3478488 = {...}) at src/mongo/db/s/sharding_ddl_coordinator_service.cpp:262
      #16 0x00007f7ba133cd60 in operator() (__closure=<optimized out>) at src/mongo/db/s/shardsvr_move_primary_command.cpp:98
      #17 operator() (__closure=<optimized out>) at src/mongo/db/s/shardsvr_move_primary_command.cpp:99
      #18 mongo::(anonymous namespace)::ShardsvrMovePrimaryCommand::Invocation::typedRun (this=0x5631b3523200, opCtx=0x5631b3479d40) at src/mongo/db/s/shardsvr_move_primary_command.cpp:102
      #19 0x00007f7b9bbc95d0 in mongo::CommandHelpers::runCommandInvocation (opCtx=0x5631b3479d40, request=..., invocation=0x5631b3523200, response=0x5631b3520460) at src/mongo/db/commands.cpp:186
      #20 0x00007f7b9bbcd27d in operator() (__closure=<optimized out>) at src/mongo/db/request_execution_context.h:69
      #21 mongo::makeReadyFutureWith<mongo::CommandHelpers::runCommandInvocation(std::shared_ptr<mongo::RequestExecutionContext>, std::shared_ptr<mongo::CommandInvocation>, bool)::<lambda()> > (func=...) at src/mongo/util/future.h:1348
      #22 mongo::CommandHelpers::runCommandInvocation (rec=std::shared_ptr<mongo::RequestExecutionContext> (empty) = {...}, invocation=std::shared_ptr<mongo::CommandInvocation> (empty) = {...}, useDedicatedThread=<optimized out>) at src/mongo/db/commands.cpp:173
      #23 0x00007f7b96576cf0 in mongo::(anonymous namespace)::runCommandInvocation (rec=std::shared_ptr<mongo::RequestExecutionContext> (empty) = {...}, invocation=std::shared_ptr<mongo::CommandInvocation> (empty) = {...}) at src/mongo/db/service_entry_point_common.cpp:161
      #24 0x00007f7b965791c6 in operator() (__closure=<optimized out>) at /opt/mongodbtoolchain/revisions/11316f1e7b36f08dcdd2ad0640af18f9287876f4/stow/gcc-v4.spX/include/c++/11.3.0/bits/shared_ptr_base.h:731
      #25 mongo::makeReadyFutureWith<mongo::(anonymous namespace)::InvokeCommand::run()::<lambda()> > (func=...) at src/mongo/util/future.h:1351
      #26 mongo::(anonymous namespace)::InvokeCommand::run (this=this@entry=0x5631b3360770) at src/mongo/db/service_entry_point_common.cpp:873
      #27 0x00007f7b96585536 in operator()<mongo::(anonymous namespace)::InvokeCommand> (path=0x5631b3360770, __closure=<optimized out>) at src/mongo/db/service_entry_point_common.cpp:1298
      #28 operator() (__closure=<optimized out>) at src/mongo/util/future_util.h:837
      #29 mongo::makeReadyFutureWith<mongo::future_util::AsyncState<mongo::(anonymous namespace)::InvokeCommand>::thenWithState<mongo::(anonymous namespace)::RunCommandImpl::_runCommand()::<lambda(auto:120*)> >(mongo::(anonymous namespace)::RunCommandImpl::_runCommand()::<lambda(auto:120*)>&&) &&::<lambda()> > (func=...) at src/mongo/util/future.h:1351
      #30 mongo::future_util::AsyncState<mongo::(anonymous namespace)::InvokeCommand>::thenWithState<mongo::(anonymous namespace)::RunCommandImpl::_runCommand()::<lambda(auto:120*)> > (launcher=..., this=<optimized out>) at src/mongo/util/future_util.h:842
      #31 mongo::(anonymous namespace)::RunCommandImpl::_runCommand (this=this@entry=0x5631b1521400) at src/mongo/db/service_entry_point_common.cpp:1298
      #32 0x00007f7b9658706a in mongo::(anonymous namespace)::RunCommandAndWaitForWriteConcern::_runCommandWithFailPoint (this=this@entry=0x5631b1521400) at src/mongo/db/service_entry_point_common.cpp:1409
      #33 0x00007f7b9658764d in mongo::(anonymous namespace)::RunCommandAndWaitForWriteConcern::_runImpl (this=0x5631b1521400) at src/mongo/db/service_entry_point_common.cpp:1329
      #34 0x00007f7b9657af9a in operator() (__closure=<optimized out>) at src/mongo/db/service_entry_point_common.cpp:759
      #35 mongo::makeReadyFutureWith<mongo::(anonymous namespace)::RunCommandImpl::run()::<lambda()> > (func=...) at src/mongo/util/future.h:1351
      #36 mongo::(anonymous namespace)::RunCommandImpl::run (this=this@entry=0x5631b1521400) at src/mongo/db/service_entry_point_common.cpp:757
      #37 0x00007f7b96581e97 in operator()<mongo::(anonymous namespace)::RunCommandAndWaitForWriteConcern> (runner=0x5631b1521400, __closure=<optimized out>) at src/mongo/db/service_entry_point_common.cpp:1800
      #38 operator() (__closure=<optimized out>) at src/mongo/util/future_util.h:837
      #39 mongo::makeReadyFutureWith<mongo::future_util::AsyncState<mongo::(anonymous namespace)::RunCommandAndWaitForWriteConcern>::thenWithState<mongo::(anonymous namespace)::ExecCommandDatabase::_commandExec()::<lambda()>::<lambda(auto:121*)> >(mongo::(anonymous namespace)::ExecCommandDatabase::_commandExec()::<lambda()>::<lambda(auto:121*)>&&) &&::<lambda()> > (func=...) at src/mongo/util/future.h:1351
      #40 mongo::future_util::AsyncState<mongo::(anonymous namespace)::RunCommandAndWaitForWriteConcern>::thenWithState<mongo::(anonymous namespace)::ExecCommandDatabase::_commandExec()::<lambda()>::<lambda(auto:121*)> > (launcher=..., this=<optimized out>) at src/mongo/util/future_util.h:842
      #41 operator() (__closure=<synthetic pointer>) at src/mongo/db/service_entry_point_common.cpp:1800
      #42 mongo::(anonymous namespace)::ExecCommandDatabase::_commandExec (this=0x5631b154c180) at src/mongo/db/service_entry_point_common.cpp:1807
      #43 0x00007f7b965830d1 in operator() (__closure=__closure@entry=0x7f7b5e3ec4a8, s=Status(StaleDbVersion, "No cached info for the database test_db")) at src/mongo/db/service_entry_point_common.cpp:1831
      #44 0x00007f7b9658210e in mongo::future_details::call<mongo::(anonymous namespace)::ExecCommandDatabase::_commandExec()::<lambda(mongo::Status)>&, mongo::Status> (arg=Status::OK(), func=...) at src/mongo/util/future_impl.h:291
      #45 mongo::future_details::throwingCall<mongo::(anonymous namespace)::ExecCommandDatabase::_commandExec()::<lambda(mongo::Status)>&, mongo::Status> (func=...) at src/mongo/util/future_impl.h:349
      #46 operator() (status=Status::OK(), __closure=0x7f7b5e3ec4a8) at src/mongo/util/future_impl.h:1180
      #47 mongo::future_details::call<mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::onError<(mongo::ErrorCodes::Error)249, mongo::CleanupFuturePolicy<false>, mongo::(anonymous namespace)::ExecCommandDatabase::_commandExec()::<lambda(mongo::Status)> >(mongo::CleanupFuturePolicy<false>, mongo::(anonymous namespace)::ExecCommandDatabase::_commandExec()::<lambda(mongo::Status)>&&) &&::<lambda(mongo::Status&&)>&, mongo::Status> (arg=Status::OK(), func=...) at src/mongo/util/future_impl.h:291
      #48 mongo::future_details::throwingCall<mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::onError<(mongo::ErrorCodes::Error)249, mongo::CleanupFuturePolicy<false>, mongo::(anonymous namespace)::ExecCommandDatabase::_commandExec()::<lambda(mongo::Status)> >(mongo::CleanupFuturePolicy<false>, mongo::(anonymous namespace)::ExecCommandDatabase::_commandExec()::<lambda(mongo::Status)>&&) &&::<lambda(mongo::Status&&)>&, mongo::Status> (func=...) at src/mongo/util/future_impl.h:349
      #49 operator() (__closure=<optimized out>, __closure=<optimized out>, status=Status::OK()) at src/mongo/util/future_impl.h:1140
      

      Current speculation is that it is possible for a command to hang in ShardingDDLCoordinatorService::waitForRecoveryCompletion() if a stepdown occurs, which will set the _state to kPaused without signaling the _recoveredOrCoordinatorCompletedCV. However, a consistent reproducer has not been able to be created, so it's possible that something else is going on. Further investigation is necessary.

            Assignee:
            backlog-server-sharding-nyc [DO NOT USE] Backlog - Sharding NYC
            Reporter:
            brett.nawrocki@mongodb.com Brett Nawrocki
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: