Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-60859

ReshardingCoordinator waits on _canEnterCritical future without cancellation, potentially preventing config server primary step-up from ever completing

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 5.2.0, 5.0.4, 5.1.0-rc2
    • Affects Version/s: 5.0.0, 5.1.0-rc0
    • Component/s: Sharding
    • None
    • Fully Compatible
    • ALL
    • v5.1, v5.0
    • Sharding 2021-11-01
    • 15
    • 1

      The wait on the future from ReshardingCoordinator::_canEnterCritical becoming ready should be subject to the abortToken. This ensures the ReshardingCoordinator instance can be reliably interrupted by shutdown, stepdown, and the abortReshardCollection command.

      LOGV2(5391602, "Resharding operation waiting for an okay to enter critical section");
      return _canEnterCritical.getFuture().thenRunOn(**executor).then([this] {
          _commitMonitorCancellationSource.cancel();
          LOGV2(5391603, "Resharding operation is okay to enter critical section");
      });
      

      The ReshardingCoordinator never makes it into the "applying" state and ShardedClusterFixture:job0:configsvr:node0 never finishes stepping up because of the ReshardingCoordinator instance still waiting on the future from ReshardingCoordinator::_canEnterCritical becoming ready.

      [j0:c:n0] | 2021-10-20T12:50:18.615+00:00 I  RESHARD  5343001 [ReshardingCoordinatorService-2] "Transitioned resharding coordinator state","attr":{"newState":"applying","oldState":"cloning","namespace":"test2_fsmdb0.fsmcoll0","collectionUUID":{"uuid":{"$uuid":"0424b1d8-0f0f-4398-8351-6e5f2af1a8c6"}},"reshardingUUID":{"uuid":{"$uuid":"5a9b5a66-2a4d-4c36-b678-c98990dfec67"}}}
      [j0:c:n0] | 2021-10-20T12:50:18.680+00:00 I  RESHARD  5391602 [ReshardingCoordinatorService-2] "Resharding operation waiting for an okay to enter critical section"
      ...
      [j0:c:n2] | 2021-10-20T12:50:32.453+00:00 I  RESHARD  5343001 [ReshardingCoordinatorService-0] "Transitioned resharding coordinator state","attr":{"newState":"cloning","oldState":"preparing-to-donate","namespace":"test2_fsmdb0.fsmcoll0","collectionUUID":{"uuid":{"$uuid":"3b6c95e8-02e9-4c5e-a0ff-dae27dbf1fb4"}},"reshardingUUID":{"uuid":{"$uuid":"ed5a3c6f-6150-4400-a942-d1160f19d810"}}}
      ...
      [j0:c:n2] | 2021-10-20T12:50:37.225+00:00 I  REPL     21358   [ReplCoord-8] "Replica set state transition","attr":{"newState":"SECONDARY","oldState":"PRIMARY"}
      [j0:c:n0] | 2021-10-20T12:50:37.226+00:00 I  REPL     21358   [ReplCoord-2] "Replica set state transition","attr":{"newState":"PRIMARY","oldState":"SECONDARY"}
      ...
      [j0:s1:n2] | 2021-10-20T12:50:38.073+00:00 I  RESHARD  5279506 [ReshardingRecipientService-6] "Transitioned resharding recipient state","attr":{"newState":"applying","oldState":"cloning","namespace":"test2_fsmdb0.fsmcoll0","collectionUUID":{"uuid":{"$uuid":"3b6c95e8-02e9-4c5e-a0ff-dae27dbf1fb4"}},"reshardingUUID":{"uuid":{"$uuid":"ed5a3c6f-6150-4400-a942-d1160f19d810"}}}
      [j0:s0:n1] | 2021-10-20T12:50:38.130+00:00 I  RESHARD  5279506 [ReshardingRecipientService-3] "Transitioned resharding recipient state","attr":{"newState":"applying","oldState":"cloning","namespace":"test2_fsmdb0.fsmcoll0","collectionUUID":{"uuid":{"$uuid":"3b6c95e8-02e9-4c5e-a0ff-dae27dbf1fb4"}},"reshardingUUID":{"uuid":{"$uuid":"ed5a3c6f-6150-4400-a942-d1160f19d810"}}}
      

      https://logkeeper.mongodb.org/lobster/build/239bc897173c2188aee52be6caf3aa9b/test/61701080f84ae841c03aa68d#bookmarks=0%2C69626%2C152348%2C152373&f~=100~%28Transitioned%20resharding%7CReplica%20set%20state%20transition%7CTransition%20to%20primary%20complete%7CResharding%20operation%20waiting%20for%20an%20okay%20to%20enter%20critical%20section%29&shareLine=69612

      [fsm_workload_test:reshard_collection_crud_ops] sh5755| Thread 45: "OplogApplier-0" (Thread 0x7f5e269a2700 (LWP 24118))
      [fsm_workload_test:reshard_collection_crud_ops] sh5755| #0  0x00007f5e4e9ef48c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
      [fsm_workload_test:reshard_collection_crud_ops] sh5755| #1  0x00007f5e50950c7c in __gthread_cond_wait (__mutex=<optimized out>, __cond=__cond@entry=0x55ae184d4750) at /data/mci/be35cd837a5670147b811bee03592d2f/toolchain-builder/tmp/build-gcc-v3.sh-1f7/build/x86_64-mongodb-linux/libstdc++-v3/include/x86_64-mongodb-linux/bits/gthr-default.h:864
      [fsm_workload_test:reshard_collection_crud_ops] sh5755| #2  std::condition_variable::wait (this=this@entry=0x55ae184d4750, __lock=...) at ../../../../../src/combined/libstdc++-v3/src/c++11/condition_variable.cc:53
      [fsm_workload_test:reshard_collection_crud_ops] sh5755| #3  0x00007f5e4e0d34c2 in std::_V2::condition_variable_any::wait<mongo::BasicLockableAdapter> (__lock=..., this=0x55ae184d4750) at /opt/mongodbtoolchain/revisions/4ac427ffa2fb12ffce7028023dae1775a06e9bf5/stow/gcc-v3.7vm/include/c++/8.5.0/bits/std_mutex.h:238
      [fsm_workload_test:reshard_collection_crud_ops] sh5755| #4  mongo::Interruptible::NotInterruptible::waitForConditionOrInterruptNoAssertUntil (deadline=..., m=..., cv=..., this=<optimized out>) at src/mongo/util/interruptible.h:571
      [fsm_workload_test:reshard_collection_crud_ops] sh5755| #5  mongo::Interruptible::waitForConditionOrInterruptUntil<std::unique_lock<std::mutex>, mongo::future_details::SharedStateBase::wait(mongo::Interruptible*)::{lambda()#1}>(mongo::stdx::condition_variable&, std::unique_lock<std::mutex>&, mongo::Date_t, mongo::future_details::SharedStateBase::wait(mongo::Interruptible*)::{lambda()#1})::{lambda(auto:1&, mongo::Interruptible::WakeSpeed)#3}::operator() (this=0x7f5e269a2560, deadline=..., speed=mongo::Interruptible::WakeSpeed::kSlow) at src/mongo/util/interruptible.h:385
      [fsm_workload_test:reshard_collection_crud_ops] sh5755| #6  0x00007f5e4d020697 in mongo::Interruptible::waitForConditionOrInterruptUntil<std::unique_lock<std::mutex>, mongo::future_details::SharedStateBase::wait(mongo::Interruptible*)::{lambda()#1}>(mongo::stdx::condition_variable&, std::unique_lock<std::mutex>&, mongo::Date_t, mongo::future_details::SharedStateBase::wait(mongo::Interruptible*)::{lambda()#1})::{lambda(auto:1&, mongo::Interruptible::WakeSpeed)#4}::operator() (speed=mongo::Interruptible::WakeSpeed::kSlow, deadline=..., this=<optimized out>) at /opt/mongodbtoolchain/revisions/4ac427ffa2fb12ffce7028023dae1775a06e9bf5/stow/gcc-v3.7vm/include/c++/8.5.0/atomic:246
      [fsm_workload_test:reshard_collection_crud_ops] sh5755| #7  mongo::Interruptible::waitForConditionOrInterruptUntil<std::unique_lock<std::mutex>, mongo::future_details::SharedStateBase::wait(mongo::Interruptible*)::{lambda()#1}>(mongo::stdx::condition_variable&, std::unique_lock<std::mutex>&, mongo::Date_t, mongo::future_details::SharedStateBase::wait(mongo::Interruptible*)::{lambda()#1}) (pred=..., finalDeadline=..., m=..., cv=..., this=<optimized out>) at src/mongo/util/interruptible.h:443
      [fsm_workload_test:reshard_collection_crud_ops] sh5755| #8  mongo::Interruptible::waitForConditionOrInterrupt<std::unique_lock<std::mutex>, mongo::future_details::SharedStateBase::wait(mongo::Interruptible*)::{lambda()#1}>(mongo::stdx::condition_variable&, std::unique_lock<std::mutex>&, mongo::future_details::SharedStateBase::wait(mongo::Interruptible*)::{lambda()#1}) (pred=..., m=..., cv=..., this=<optimized out>) at src/mongo/util/interruptible.h:458
      [fsm_workload_test:reshard_collection_crud_ops] sh5755| #9  mongo::future_details::SharedStateBase::wait (interruptible=<optimized out>, this=<optimized out>) at src/mongo/util/future_impl.h:402
      [fsm_workload_test:reshard_collection_crud_ops] sh5755| #10 mongo::future_details::SharedStateHolder<mongo::future_details::FakeVoid>::wait (interruptible=<optimized out>, this=<optimized out>) at src/mongo/util/future_impl.h:658
      [fsm_workload_test:reshard_collection_crud_ops] sh5755| #11 mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::wait (interruptible=<optimized out>, this=<optimized out>) at src/mongo/util/future_impl.h:830
      [fsm_workload_test:reshard_collection_crud_ops] sh5755| #12 mongo::SemiFuture<void>::wait (interruptible=<optimized out>, this=<optimized out>) at src/mongo/util/future.h:189
      [fsm_workload_test:reshard_collection_crud_ops] sh5755| #13 mongo::repl::PrimaryOnlyService::ActiveInstance::waitForCompletion (this=<optimized out>) at src/mongo/db/repl/primary_only_service.h:359
      [fsm_workload_test:reshard_collection_crud_ops] sh5755| #14 mongo::repl::PrimaryOnlyService::onStepUp (this=0x55ae130bb780, stepUpOpTime=...) at src/mongo/db/repl/primary_only_service.cpp:361
      [fsm_workload_test:reshard_collection_crud_ops] sh5755| #15 0x00007f5e4d021860 in mongo::repl::PrimaryOnlyServiceRegistry::onStepUpComplete (this=<optimized out>, opCtx=<optimized out>, term=<optimized out>) at /opt/mongodbtoolchain/revisions/4ac427ffa2fb12ffce7028023dae1775a06e9bf5/stow/gcc-v3.7vm/include/c++/8.5.0/bits/unique_ptr.h:345
      [fsm_workload_test:reshard_collection_crud_ops] sh5755| #16 0x00007f5e4cff3e53 in mongo::ReplicaSetAwareServiceRegistry::<lambda(mongo::ReplicaSetAwareInterface*)>::operator() (__closure=<synthetic pointer>, __closure=<synthetic pointer>, service=<optimized out>) at src/mongo/db/repl/replica_set_aware_service.cpp:73
      [fsm_workload_test:reshard_collection_crud_ops] sh5755| #17 std::for_each<__gnu_cxx::__normal_iterator<mongo::ReplicaSetAwareInterface**, std::vector<mongo::ReplicaSetAwareInterface*> >, mongo::ReplicaSetAwareServiceRegistry::onStepUpComplete(mongo::OperationContext*, long long int)::<lambda(mongo::ReplicaSetAwareInterface*)> > (__f=..., __last=0x392f002a735c2925, __first=0x55ae13055fa0) at /opt/mongodbtoolchain/revisions/4ac427ffa2fb12ffce7028023dae1775a06e9bf5/stow/gcc-v3.7vm/include/c++/8.5.0/bits/stl_algo.h:3882
      [fsm_workload_test:reshard_collection_crud_ops] sh5755| #18 mongo::ReplicaSetAwareServiceRegistry::onStepUpComplete (this=<optimized out>, opCtx=0x55ae18266400, term=17) at src/mongo/db/repl/replica_set_aware_service.cpp:74
      [fsm_workload_test:reshard_collection_crud_ops] sh5755| #19 0x00007f5e4cd73181 in mongo::repl::ReplicationCoordinatorImpl::signalDrainComplete (this=0x55ae12e5e000, opCtx=0x55ae18266400, termWhenBufferIsEmpty=17) at src/mongo/db/repl/optime.h:85
      [fsm_workload_test:reshard_collection_crud_ops] sh5755| #20 0x00007f5e4cc24d4c in mongo::repl::OplogApplierImpl::_run (this=0x55ae1878b380, oplogBuffer=<optimized out>) at src/mongo/db/repl/oplog_batcher.h:94
      

      https://logkeeper.mongodb.org/lobster/build/239bc897173c2188aee52be6caf3aa9b/test/61701080f84ae841c03aa68d#bookmarks=0%2C69626%2C152348%2C152373&shareLine=69612

            Assignee:
            max.hirschhorn@mongodb.com Max Hirschhorn
            Reporter:
            max.hirschhorn@mongodb.com Max Hirschhorn
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: