Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-45646

If a filtering metadata refresh is scheduled while a node is secondary and there is no primary, then the node becomes primary, the refresh can deadlock with itself

    • Sharding EMEA
    • ALL
    • Sharding EMEA 2023-01-23, Sharding EMEA 2023-02-06

      Problem
      If a node is secondary when it decides whether to use the primary or secondary refresh logic (and was secondary when the refresh was scheduled), it will choose the secondary refresh logic.

      However, if there is no primary when the node goes to tell its primary to refresh, the refresh task will block waiting to find a primary to satisfy the primary-only readPreference.

      If the node then itself becomes primary, the node will end up targeting itself for the force refresh, which blocks behind the active refresh, causing a deadlock. The force refresh ends up timing out after 30 seconds with NetworkInterfaceExceededTimeLimit on the send side.

      One negative aspect of this is that all refreshes on the collection triggered while the deadlock is active stall until the deadlock is resolved by the force refreshing timing out.

      On the other hand, after the deadlock is resolved, the node correctly uses the primary refresh logic for the next refresh that is triggered, so the deadlock does not reoccur.

      Why it isn't handled properly by the server

      The OperationContext being used for the force refresh is not interrupted when the node becomes primary, since the OperationContext was not used to take a strong lock.

      Ways this can be seen in the wild

      This can happen if secondary reads are being used, since secondary reads can routinely trigger filtering metadata refreshes while a node is secondary.

      However, it can also happen even if secondary reads are not being used. For example:

      • A node is primary when it receives a primary-only request
      • The node steps down before the request has taken a strong lock, so the request is not interrupted
      • The request schedules a filtering metadata refresh while secondary, and while there is no primary
      • The node itself transitions back to primary

      This latter case is more likely to occur if a user is inducing a failover on the shard, particularly if they are using replication priorities to try to ensure a sticky primary.

      I can reproduce this latter case by applying the following diff (which includes some extra logging to see what's happening) and running the test jstests/sharding/migration_coordinator_failover.js:

      diff --git a/jstests/sharding/migration_coordinator_failover.js b/jstests/sharding/migration_coordinator_failover.js
      index 4418ea7589..4d14000639 100644
      --- a/jstests/sharding/migration_coordinator_failover.js
      +++ b/jstests/sharding/migration_coordinator_failover.js
      @@ -25,7 +25,7 @@ function getNewNs(dbName) {
       
       const dbName = "test";
       
      -var st = new ShardingTest({shards: 2, rs: {nodes: 2}});
      +var st = new ShardingTest({shards: 2, rs: {nodes: [{}, {rsConfig: {priority: 0}}]}});
       
       assert.commandWorked(st.s.adminCommand({enableSharding: dbName}));
       assert.commandWorked(st.s.adminCommand({movePrimary: dbName, to: st.shard0.shardName}));
      @@ -76,6 +76,11 @@ function runMoveChunkMakeDonorStepDownAfterFailpoint(
           failpointHandle.wait();
       
           jsTest.log("Make the donor primary step down.");
      +    // Note: stepUpNoAwaitReplication waits for the old primary to see the new primary, which means
      +    // it waits for the old primary to have stepped down. Making the secondary step up makes the
      +    // test run faster than making the primary step down, since it avoids waiting for the election
      +    // timeout (10 seconds) for the replica set to run an election.
      +    //st.rs0.stepUpNoAwaitReplication(st.rs0.getSecondary());
           assert.commandWorked(
               st.rs0.getPrimary().adminCommand({replSetStepDown: 10 /* stepDownSecs */, force: true}));
           failpointHandle.off();
      @@ -122,48 +127,48 @@ function runMoveChunkMakeDonorStepDownAfterFailpoint(
       // Decision is commit
       //
       
      -runMoveChunkMakeDonorStepDownAfterFailpoint("hangBeforeMakingCommitDecisionDurable",
      -                                            false /* shouldMakeMigrationFailToCommitOnConfig */);
      -runMoveChunkMakeDonorStepDownAfterFailpoint("hangBeforeSendingCommitDecision",
      -                                            false /* shouldMakeMigrationFailToCommitOnConfig */);
      -runMoveChunkMakeDonorStepDownAfterFailpoint("hangBeforeForgettingMigrationAfterCommitDecision",
      -                                            false /* shouldMakeMigrationFailToCommitOnConfig */);
      -
      +//runMoveChunkMakeDonorStepDownAfterFailpoint("hangBeforeMakingCommitDecisionDurable",
      +//                                            false /* shouldMakeMigrationFailToCommitOnConfig */);
      +//runMoveChunkMakeDonorStepDownAfterFailpoint("hangBeforeSendingCommitDecision",
      +//                                            false /* shouldMakeMigrationFailToCommitOnConfig */);
      +//runMoveChunkMakeDonorStepDownAfterFailpoint("hangBeforeForgettingMigrationAfterCommitDecision",
      +//                                            false /* shouldMakeMigrationFailToCommitOnConfig */);
       //
      -// Decision is abort
      +////
      +//// Decision is abort
      +////
       //
      +//runMoveChunkMakeDonorStepDownAfterFailpoint("moveChunkHangAtStep3",
      +//                                            false /* shouldMakeMigrationFailToCommitOnConfig */,
      +//                                            ErrorCodes.OperationFailed);
      +//
      +//runMoveChunkMakeDonorStepDownAfterFailpoint("moveChunkHangAtStep4",
      +//                                            false /* shouldMakeMigrationFailToCommitOnConfig */,
      +//                                            ErrorCodes.OperationFailed);
       
      -runMoveChunkMakeDonorStepDownAfterFailpoint("moveChunkHangAtStep3",
      -                                            false /* shouldMakeMigrationFailToCommitOnConfig */,
      -                                            ErrorCodes.OperationFailed);
      -
      -runMoveChunkMakeDonorStepDownAfterFailpoint("moveChunkHangAtStep4",
      -                                            false /* shouldMakeMigrationFailToCommitOnConfig */,
      -                                            ErrorCodes.OperationFailed);
      -
      -runMoveChunkMakeDonorStepDownAfterFailpoint("moveChunkHangAtStep5",
      -                                            false /* shouldMakeMigrationFailToCommitOnConfig */,
      -                                            ErrorCodes.OperationFailed);
      -
      -runMoveChunkMakeDonorStepDownAfterFailpoint("hangInEnsureChunkVersionIsGreaterThanThenThrow",
      -                                            true /* shouldMakeMigrationFailToCommitOnConfig */,
      -                                            ErrorCodes.OperationFailed);
      -
      +//runMoveChunkMakeDonorStepDownAfterFailpoint("moveChunkHangAtStep5",
      +//                                            false /* shouldMakeMigrationFailToCommitOnConfig */,
      +//                                            ErrorCodes.OperationFailed);
      +//
      +//runMoveChunkMakeDonorStepDownAfterFailpoint("hangInEnsureChunkVersionIsGreaterThanThenThrow",
      +//                                            true /* shouldMakeMigrationFailToCommitOnConfig */,
      +//                                            ErrorCodes.OperationFailed);
      +//
       runMoveChunkMakeDonorStepDownAfterFailpoint("hangInRefreshFilteringMetadataUntilSuccessThenThrow",
                                                   true /* shouldMakeMigrationFailToCommitOnConfig */,
                                                   ErrorCodes.OperationFailed);
      -
      -runMoveChunkMakeDonorStepDownAfterFailpoint("hangBeforeMakingAbortDecisionDurable",
      -                                            true /* shouldMakeMigrationFailToCommitOnConfig */,
      -                                            ErrorCodes.StaleEpoch);
      -
      -runMoveChunkMakeDonorStepDownAfterFailpoint("hangBeforeSendingAbortDecision",
      -                                            true /* shouldMakeMigrationFailToCommitOnConfig */,
      -                                            ErrorCodes.StaleEpoch);
      -
      -runMoveChunkMakeDonorStepDownAfterFailpoint("hangBeforeForgettingMigrationAfterAbortDecision",
      -                                            true /* shouldMakeMigrationFailToCommitOnConfig */,
      -                                            ErrorCodes.StaleEpoch);
      -
      +//
      +//runMoveChunkMakeDonorStepDownAfterFailpoint("hangBeforeMakingAbortDecisionDurable",
      +//                                            true /* shouldMakeMigrationFailToCommitOnConfig */,
      +//                                            ErrorCodes.StaleEpoch);
      +//
      +//runMoveChunkMakeDonorStepDownAfterFailpoint("hangBeforeSendingAbortDecision",
      +//                                            true /* shouldMakeMigrationFailToCommitOnConfig */,
      +//                                            ErrorCodes.StaleEpoch);
      +//
      +//runMoveChunkMakeDonorStepDownAfterFailpoint("hangBeforeForgettingMigrationAfterAbortDecision",
      +//                                            true /* shouldMakeMigrationFailToCommitOnConfig */,
      +//                                            ErrorCodes.StaleEpoch);
      +//
       st.stop();
       })();
      diff --git a/src/mongo/db/repl/replication_coordinator_impl.cpp b/src/mongo/db/repl/replication_coordinator_impl.cpp
      index 5931e3f4f7..519d0a9672 100644
      --- a/src/mongo/db/repl/replication_coordinator_impl.cpp
      +++ b/src/mongo/db/repl/replication_coordinator_impl.cpp
      @@ -2161,6 +2161,9 @@ void ReplicationCoordinatorImpl::stepDown(OperationContext* opCtx,
           auto deadline = force ? stepDownUntil : waitUntil;
           AutoGetRstlForStepUpStepDown arsd(
               this, opCtx, ReplicationCoordinator::OpsKillingStateTransitionEnum::kStepDown, deadline);
      +    LOG(0) << "xxx about to sleep after killing ops";
      +    sleepmillis(1000);
      +    LOG(0) << "xxx done sleeping after killing ops";
       
           stdx::unique_lock<Latch> lk(_mutex);
       
      diff --git a/src/mongo/db/s/migration_util.cpp b/src/mongo/db/s/migration_util.cpp
      index 18c0647206..d904e5ed5d 100644
      --- a/src/mongo/db/s/migration_util.cpp
      +++ b/src/mongo/db/s/migration_util.cpp
      @@ -524,6 +524,10 @@ void refreshFilteringMetadataUntilSuccess(OperationContext* opCtx, const Namespa
                   auto newOpCtxPtr = cc().makeOperationContext();
                   auto newOpCtx = newOpCtxPtr.get();
       
      +            LOG(0) << "xxx about to sleep before forcing refresh";
      +            sleepmillis(2000);
      +            LOG(0) << "xxx done sleeping before forcing refresh";
      +
                   forceShardFilteringMetadataRefresh(newOpCtx, nss, true);
       
                   // 'newOpCtx' won't get interrupted if a stepdown occurs while the thread is hanging in
      @@ -531,12 +535,16 @@ void refreshFilteringMetadataUntilSuccess(OperationContext* opCtx, const Namespa
                   // MODE_X lock. To ensure the catch block is entered if the failpoint was set, throw an
                   // arbitrary error.
                   if (hangInRefreshFilteringMetadataUntilSuccessThenThrow.shouldFail()) {
      +                LOG(0) << "xxx failpoint is on";
                       hangInRefreshFilteringMetadataUntilSuccessThenThrow.pauseWhileSet(newOpCtx);
                       uasserted(ErrorCodes::InternalError,
                                 "simulate an error response for forceShardFilteringMetadataRefresh");
                   }
      +            LOG(0) << "xxx breaking out of loop";
                   break;
               } catch (const DBException& ex) {
      +            LOG(0) << "xxx caught exception: " << ex.toStatus();
      +
                   // If the server is already doing a clean shutdown, join the shutdown.
                   if (globalInShutdownDeprecated()) {
                       shutdown(waitForShutdown());
      diff --git a/src/mongo/db/s/shard_server_catalog_cache_loader.cpp b/src/mongo/db/s/shard_server_catalog_cache_loader.cpp
      index 682954e7b4..335c18c301 100644
      --- a/src/mongo/db/s/shard_server_catalog_cache_loader.cpp
      +++ b/src/mongo/db/s/shard_server_catalog_cache_loader.cpp
      @@ -379,8 +379,10 @@ void ShardServerCatalogCacheLoader::initializeReplicaSetRole(bool isPrimary) {
           invariant(_role == ReplicaSetRole::None);
       
           if (isPrimary) {
      +        LOG(0) << "xxx initializing role to primary";
               _role = ReplicaSetRole::Primary;
           } else {
      +        LOG(0) << "xxx initializing role to secondary";
               _role = ReplicaSetRole::Secondary;
           }
       }
      @@ -390,6 +392,7 @@ void ShardServerCatalogCacheLoader::onStepDown() {
           invariant(_role != ReplicaSetRole::None);
           _contexts.interrupt(ErrorCodes::PrimarySteppedDown);
           ++_term;
      +    LOG(0) << "xxx setting role to secondary, new term is " << _term;
           _role = ReplicaSetRole::Secondary;
       }
       
      @@ -397,6 +400,7 @@ void ShardServerCatalogCacheLoader::onStepUp() {
           stdx::lock_guard<Latch> lg(_mutex);
           invariant(_role != ReplicaSetRole::None);
           ++_term;
      +    LOG(0) << "xxx setting role to primary, new term is " << _term;
           _role = ReplicaSetRole::Primary;
       }
       
      @@ -432,6 +436,7 @@ std::shared_ptr<Notification<void>> ShardServerCatalogCacheLoader::getChunksSinc
           long long term;
           std::tie(isPrimary, term) = [&] {
               stdx::lock_guard<Latch> lock(_mutex);
      +        LOG(0) << "xxx in getChunksSince, _role is " << (_role == ReplicaSetRole::Primary ? "primary" : "secondary") << ", _term is " << _term;
               return std::make_tuple(_role == ReplicaSetRole::Primary, _term);
           }();
       
      @@ -448,6 +453,7 @@ std::shared_ptr<Notification<void>> ShardServerCatalogCacheLoader::getChunksSinc
                           // began but before the OperationContext was added to the group. So we'll check
                           // that we're still in the same _term.
                           stdx::lock_guard<Latch> lock(_mutex);
      +                    LOG(0) << "xxx in task for getChunksSince, term is now " << _term;
                           uassert(ErrorCodes::InterruptedDueToReplStateChange,
                                   "Unable to refresh routing table because replica set state changed or "
                                   "the node is shutting down.",
      

      From the test output:

      // Refresh is scheduled while node is secondary
      d15520| 2020-02-12T18:08:21.379-0500 I  SHARDING [conn24] xxx in getChunksSince, _role is secondary, _term is 1
      d15520| 2020-02-12T18:08:21.379-0500 I  SHARDING [ShardServerCatalogCacheLoader-1] xxx in task for getChunksSince, term is now 1
      
      // Node cannot find primary
      d15520| 2020-02-12T18:08:21.381-0500 W  NETWORK  [ReplicaSetMonitor-TaskExecutor] Unable to reach primary for set migration_coordinator_failover-rs0
      
      // Node itself runs election and is elected primary
      d15520| 2020-02-12T18:08:30.714-0500 I  ELECTION [ReplCoord-1] Starting an election, since we've seen no PRIMARY in the past 10000ms
      d15520| 2020-02-12T18:08:30.749-0500 I  ELECTION [ReplCoord-1] election succeeded, assuming primary role in term 2
      d15520| 2020-02-12T18:08:30.749-0500 I  REPL     [ReplCoord-1] transition to PRIMARY from SECONDARY
      
      // Node's ShardServerCatalogCacheLoader's role and term are bumped as part of the stepup
      d15520| 2020-02-12T18:08:31.403-0500 I  SHARDING [OplogApplier-0] xxx setting role to primary, new term is 2
      
      // The refresh that was scheduled while the node was secondary errors with NetworkInterfaceExceededTimeLimit on a force refresh to itself
      d15520| 2020-02-12T18:09:01.882-0500 I  NETWORK  [TransactionCoordinator] Marking host eshamaharishi-X10DAi:15520 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Request 177 timed out, deadline was 2020-02-12T18:09:01.882-0500, op was RemoteCommand 177 -- target:[eshamaharishi-X10DAi:15520] db:admin expDate:2020-02-12T18:09:01.882-0500 cmd:{ _flushRoutingTableCacheUpdates: "test.ns1", maxTimeMS: 30000 }
      

            Assignee:
            sergi.mateo-bellido@mongodb.com Sergi Mateo Bellido
            Reporter:
            esha.maharishi@mongodb.com Esha Maharishi (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated:
              Resolved: