[SERVER-56638] Fix flushReshardingStateChanges critical section race Created: 04/May/21  Updated: 29/Oct/23  Resolved: 21/May/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 5.0.0-rc0, 5.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Haley Connelly Assignee: Randolph Tan
Resolution: Fixed Votes: 0
Labels: PM-234-M3, PM-234-T-lifecycle, post-rc0
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Duplicate
is duplicated by SERVER-55852 Shards first acquire LockManager lock... Closed
Related
related to SERVER-57953 _flushReshardingStateChange attempts ... Closed
is related to SERVER-56612 Use the resharding-specific refresh f... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.0
Steps To Reproduce:

resmoke --suites=sharding jstests/sharding/resharding_fails_on_nonempty_stash.js

diff --git a/jstests/sharding/resharding_fails_on_nonempty_stash.js b/jstests/sharding/resharding_fails_on_nonempty_stash.js
index b627d3b186..029b3cee53 100644
--- a/jstests/sharding/resharding_fails_on_nonempty_stash.js
+++ b/jstests/sharding/resharding_fails_on_nonempty_stash.js
@@ -34,6 +34,9 @@ const recipient1Conn = new Mongo(topology.shards[recipientShardNames[1]].primary
 const removeRecipientDocumentFailpoint =
     configureFailPoint(recipient1Conn, "removeRecipientDocFailpoint");
 
+const recipient0Conn = new Mongo(topology.shards[recipientShardNames[0]].primary);
+const pauseCriticalSectionFP = configureFailPoint(recipient0Conn, "reshardingPauseCriticalSection");
+
 reshardingTest.withReshardingInBackground(
     {
         newShardKeyPattern: {newKey: 1},
diff --git a/src/mongo/db/s/flush_resharding_state_change_command.cpp b/src/mongo/db/s/flush_resharding_state_change_command.cpp
index 956387907e..8c88411db1 100644
--- a/src/mongo/db/s/flush_resharding_state_change_command.cpp
+++ b/src/mongo/db/s/flush_resharding_state_change_command.cpp
@@ -51,6 +51,8 @@
 #include "mongo/s/request_types/flush_resharding_state_change_gen.h"
 
 namespace mongo {
+
+MONGO_FAIL_POINT_DEFINE(reshardingPauseCriticalSection);
 namespace {
 
 void refreshShardVersion(OperationContext* opCtx, const NamespaceString& nss) {
@@ -94,12 +96,25 @@ void refreshShardVersion(OperationContext* opCtx, const NamespaceString& nss) {
 
             collLock.reset();
             dbLock.reset();
-
             inRecoverOrRefresh->get(opCtx);
         } else {
             collLock.reset();
             dbLock.reset();
-
+            if (MONGO_unlikely(reshardingPauseCriticalSection.shouldFail())) {
+                sleepsecs(2);
+            }
+            {
+                boost::optional<Lock::DBLock> dbLock2;
+                dbLock2.emplace(opCtx, nss.db(), MODE_IS);
+
+                boost::optional<Lock::CollectionLock> collLock2;
+                collLock2.emplace(opCtx, nss, MODE_IS);
+
+                const auto csr = CollectionShardingRuntime::get(opCtx, nss);
+                auto critSec2 =
+                    csr->getCriticalSectionSignal(opCtx, ShardingMigrationCriticalSection::kWrite);
+                invariant(!critSec2);
+            }
             onShardVersionMismatch(opCtx, nss, boost::none);
         }
     }

Sprint: Sharding 2021-05-17, Sharding 2021-05-31
Participants:
Story Points: 2

 Description   

Summary: It's possible for a resharding refresh to hang on a recipient if the recipient acquires the critical section between the time flushReshardingStateChanges acquires locks, checks the critical section, releases the locks, and calls onShardVersionMismatch

Scenario of recipient0 causing a hang:

  • recipient1 errors, transitions from steady-state to error

    [js_test:resharding_fails_on_nonempty_stash] d20026| 2021-05-04T00:56:36.146+00:00 I  RESHARD  5279506 [ReshardingRecipientService-2] "Transitioned resharding recipient state","attr":{"newState":"error","oldState":"steady-state","namespace":"reshardingDb.coll","collectionUUID":{"uuid":{"$uuid":"23d3b803-bbd6-4766-beea-80d7d50e884a"}},"reshardingUUID":{"uuid":{"$uuid":"43ab79c0-6627-4c12-ac92-a4b315575c00"}}}
    

  • the coordinator transitions to error

    [js_test:resharding_fails_on_nonempty_stash] c20028| 2021-05-04T00:56:36.159+00:00 I  RESHARD  5343001 [ReshardingCoordinatorService-1] "Transitioned resharding coordinator state","attr":{"newState":"error","oldState":"blocking-writes","namespace":"reshardingDb.coll","collectionUUID":{"uuid":{"$uuid":"23d3b803-bbd6-4766-beea-80d7d50e884a"}},"reshardingUUID":{"uuid":{"$uuid":"43ab79c0-6627-4c12-ac92-a4b315575c00"}}}
    

  • recipient0 transitions to strict-consistency after flushReshardingStateChanges acquires the locks to check the critical section and releases them but before flushReshardingStateChanges calls onShardVersionMismatch

    [js_test:resharding_fails_on_nonempty_stash] d20024| 2021-05-04T00:56:36.160+00:00 I  RESHARD  5279506 [ReshardingRecipientService-0] "Transitioned resharding recipient state","attr":{"newState":"strict-consistency","oldState":"steady-state","namespace":"reshardingDb.coll","collectionUUID":{"uuid":{"$uuid":"23d3b803-bbd6-4766-beea-80d7d50e884a"}},"reshardingUUID":{"uuid":{"$uuid":"43ab79c0-6627-4c12-ac92-a4b315575c00"}}}
    



 Comments   
Comment by Vivian Ge (Inactive) [ 06/Oct/21 ]

Updating the fixversion since branching activities occurred yesterday. This ticket will be in rc0 when it’s been triggered. For more active release information, please keep an eye on #server-release. Thank you!

Comment by Githook User [ 21/May/21 ]

Author:

{'name': 'Randolph Tan', 'email': 'randolph@10gen.com', 'username': 'renctan'}

Message: SERVER-56638 Create explicit command for abort participants in reshardCollection

(cherry picked from commit b7cf7fb4099b72ce64f7ba6904d3701245a7d6b5)
Branch: v5.0
https://github.com/mongodb/mongo/commit/54250992a5ea46715861aa8191d43286c40ad071

Comment by Githook User [ 21/May/21 ]

Author:

{'name': 'Randolph Tan', 'email': 'randolph@10gen.com', 'username': 'renctan'}

Message: SERVER-56638 Create explicit command for abort participants in reshardCollection
Branch: master
https://github.com/mongodb/mongo/commit/b7cf7fb4099b72ce64f7ba6904d3701245a7d6b5

Comment by Max Hirschhorn [ 05/May/21 ]

This is a great find! I think to address both SERVER-56638 and SERVER-55852 we should introduce an explicit _shardsvrAbortReshardCollection that the resharding coordinator sends to abort the resharding operation (rather than the generic _flushReshardingStateChange). The {_shardsvrAbortReshardCollection: <reshardingUUID>} command would call new DonorStateMachine::abort() and RecipientStateMachine::abort() methods to cancel their _abortSource. In particular, the resharding coordinator actively aborting a resharding operation would not depend in any way on the ability to complete a shard version refresh.

  • The _shardsvrAbortReshardCollection command must check for both a DonorStateMachine and a RecipientStateMachine to call abort() on.
  • It isn't an error if _shardsvrAbortReshardCollection finds neither a DonorStateMachine nor a RecipientStateMachine. This is because it is possible for there to have been a delayed message where the donor and/or recipient has already exited by the time the _shardsvrAbortReshardCollection is being processed. Similarly, the abort() method must allow being called multiple times.
  • The _shardsvrAbortReshardCollection command must wait for the donor + recipient to have transitioned to kDone and for that local write to have become majority-committed. (It doesn't need to wait for the donor/recipient to have updated the config.reshardingOperations collection on the config server though.) The reasoning here is two-fold:
    1. Performing a write and waiting for it to be majority-committed after having called the abort() method(s) ensures the resharding coordinator has contacted a current primary.
    2. Step-up will continue to trigger a shard version refresh and may discover the resharding operation has been aborted. However, the race described in SERVER-56638 will continue to exist for that scenario and may lead to the shard version refresh becoming stuck. Requiring the resharding coordinator wait for the donor/recipient to have persisted its acknowledgment of the abort signal guarantees that the resharding coordinator would be separately sending (and continually retrying) the _shardsvrAbortReshardCollection command in any cases where the shard version refresh could get stuck.
Generated at Thu Feb 08 05:39:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.