[SERVER-54020] ShardInvalidatedForTargeting thrown by resharding's getDestinedRecipient() not being retried by mongos Created: 25/Jan/21  Updated: 29/Oct/23  Resolved: 06/Apr/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 5.0.0-rc0

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Jordi Serra Torrens
Resolution: Fixed Votes: 0
Labels: PM-234-M2.5, PM-234-T-lifecycle, PM-234-T-oplog-fetch
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-81508 Potential double-execution of write s... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

Apply the following patch to delay when the donor shard performs its refresh of the temporary resharding collection.

python buildscripts/resmoke.py run --suite=sharding repro_resharding_write_before_temp_ns_refresh.js

diff --git a/repro_resharding_write_before_temp_ns_refresh.js b/repro_resharding_write_before_temp_ns_refresh.js
new file mode 100644
index 0000000000..7012fae165
--- /dev/null
+++ b/repro_resharding_write_before_temp_ns_refresh.js
@@ -0,0 +1,33 @@
+(function() {
+"use strict";
+
+load("jstests/sharding/libs/resharding_test_fixture.js");
+
+const reshardingTest = new ReshardingTest({numDonors: 2, numRecipients: 2, reshardInPlace: true});
+reshardingTest.setup();
+
+const donorShardNames = reshardingTest.donorShardNames;
+const sourceCollection = reshardingTest.createShardedCollection({
+    ns: "reshardingDb.coll",
+    shardKeyPattern: {oldKey: 1},
+    chunks: [
+        {min: {oldKey: MinKey}, max: {oldKey: 10}, shard: donorShardNames[0]},
+        {min: {oldKey: 10}, max: {oldKey: MaxKey}, shard: donorShardNames[1]},
+    ],
+});
+
+const recipientShardNames = reshardingTest.recipientShardNames;
+reshardingTest.withReshardingInBackground(  //
+    {
+        newShardKeyPattern: {newKey: 1},
+        newChunks: [
+            {min: {newKey: MinKey}, max: {newKey: 10}, shard: recipientShardNames[0]},
+            {min: {newKey: 10}, max: {newKey: MaxKey}, shard: recipientShardNames[1]},
+        ],
+    },
+    () => {
+        assert.commandWorked(sourceCollection.insert({_id: 0, oldKey: 5, newKey: 15}));
+    });
+
+reshardingTest.teardown();
+})();
diff --git a/src/mongo/db/s/resharding/resharding_donor_service.cpp b/src/mongo/db/s/resharding/resharding_donor_service.cpp
index 2a7200f38e..6ba5aa0973 100644
--- a/src/mongo/db/s/resharding/resharding_donor_service.cpp
+++ b/src/mongo/db/s/resharding/resharding_donor_service.cpp
@@ -263,6 +263,8 @@ void ReshardingDonorService::DonorStateMachine::
             ->assertNoIndexBuildInProgForCollection(_donorDoc.getExistingUUID());
     }
 
+    sleepsecs(10);
+
     // Recipient shards expect to read from the donor shard's existing sharded collection
     // and the config.cache.chunks collection of the temporary resharding collection using
     // {atClusterTime: <fetchTimestamp>}. Refreshing the temporary resharding collection on

Sprint: Sharding 2021-03-22, Sharding 2021-04-05, Sharding EMEA 2021-05-03
Participants:
Linked BF Score: 39
Story Points: 1

 Description   

For some reason, the ShardInvalidatedForTargeting exception is being propagated back to the client as a write error rather than being automatically retried by mongos.

bool allowLocks = true;
auto tempNssRoutingInfo = Grid::get(opCtx)->catalogCache()->getCollectionRoutingInfo(
    opCtx,
    constructTemporaryReshardingNss(sourceNss.db(), getCollectionUuid(opCtx, sourceNss)),
    allowLocks);
 
uassert(ShardInvalidatedForTargetingInfo(sourceNss),
        "Routing information is not available for the temporary resharding collection.",
        tempNssRoutingInfo.getStatus() != ErrorCodes::StaleShardVersion);
 
uassertStatusOK(tempNssRoutingInfo);



 Comments   
Comment by Githook User [ 06/Apr/21 ]

Author:

{'name': 'Jordi Serra Torrens', 'email': 'jordi.serra-torrens@mongodb.com', 'username': 'jordist'}

Message: SERVER-54020: ShardInvalidatedForTargeting thrown by resharding's getDestinedRecipient() not being retried by mongos
Branch: master
https://github.com/mongodb/mongo/commit/3aa71ec3ef14d5354850e905600aa5cda2fcbba3

Comment by Max Hirschhorn [ 05/Feb/21 ]

I chatted with Kal over Zoom about this ticket. I was hoping that waiting for the effects of the refresh to majority-committed would make it so a new replica set shard primary (different from the primary which had performed the original refresh) could read from its config.cache.chunks collection locally without needing to contact the config server to answer a CatalogCache::getCollectionRoutingInfo() request. However, the CatalogCache won't necessarily have an entry for the requested collection populated already and will attempt to contact the config server regardless of the allowLocks parameter and may fail to complete the refresh before returning from CatalogCache::getCollectionRoutingInfo(). It is therefore necessary to support the retry logic. My earlier comment about introducing a new kPreparingTempNs coordinator state isn't helpful for fully addressing this issue.

Here is where the refresh for the collection is scheduled and waited upon on the shard before propagating the StaleConfig exception to the router too. Kal's suggestion is to either reuse StaleConfigInfo, or create a new exception that causes the shard to refresh and also has mongos retry.

Comment by Max Hirschhorn [ 25/Jan/21 ]

One thought would be to revert back to the approach in https://github.com/mongodb/mongo/commit/9b15b5a07c8e47e9be4f886ce7c6076fd5c66e87#diff-961b971cf89bf81242a7502df83493edaaf7236a84e69371c9a2409ebddcbed6 and instead introduce a new coordinator state to sequence the different collection refreshes.

  • kInitializing = coordinator has written down in config.reshardingOperations that a resharding operation has begun.
  • kPreparingTempNs (new) = coordinator has written down the config.collections and config.chunks entries for the temporary resharding collection.
  • kPreparingToDonate = donor shards should create a DonorStateMachine upon refresh.

Donor shards would be instructed to refresh the temporary resharding collection after the coordinator state is kPreparingTempNs and then instructed to refresh the existing sharded collection after the coordinator state is kPreparingToDonate.

CC blake.oler

Generated at Thu Feb 08 05:32:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.