[SERVER-53539] TypeCollectionReshardingFields are incorrect following a shard version refresh Created: 30/Dec/20  Updated: 29/Oct/23  Resolved: 18/Mar/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 4.9.0

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Jordi Serra Torrens
Resolution: Fixed Votes: 0
Labels: PM-234-Catalog-Work, PM-234-M2, PM-234-T-lifecycle
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-54874 Ensure reading of consistent config.c... Closed
depends on SERVER-55146 Bump collection version on any modifi... Closed
Related
related to SERVER-55307 Complete TODO listed in SERVER-53539 Closed
is related to SERVER-52620 Update resharding_replicate_updates_a... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

The collection version is seen by both shards d20020 and d20022 as being 2|1||5fec046614ff529dbac7fa05. However, only shard d20022 correctly sees the coordinator state as "cloning" while shard d20020 incorrectly sees the coordinator state as "preparing-to-donate". This causes the d20020 shard to skip constructing a RecipientStateMachine but leads the coordinator (config server) to believe the d20020 shard has finished refreshing. The resharding operation is then left unable to make further progress.

This issue appears to only happen (and only very rarely happen) when the temporary resharding collection is being queried via mongos by the test client. I wonder if there's another issue along the lines of SERVER-51510 in ShardServerCatalogCacheLoader::_getLoaderMetadata() still.

[js_test:resharding_replicate_updates_as_insert_delete] 2020-12-30T04:39:02.423+0000 d20020| 2020-12-30T04:39:02.423+00:00 I  SH_REFR  4619901 [CatalogCache-0] "Refreshed cached collection","attr":{"namespace":"test.system.resharding.0fe4b9ee-41d2-4855-8411-32539bc84657","newVersion":{"chunkVersion":{"0":{"$timestamp":{"t":2,"i":1}},"1":{"$oid":"5fec046614ff529dbac7fa05"}},"forcedRefreshSequenceNum":1,"epochDisambiguatingSequenceNum":7},"oldVersion":{"chunkVersion":"None","forcedRefreshSequenceNum":0,"epochDisambiguatingSequenceNum":0},"durationMillis":2}
[js_test:resharding_replicate_updates_as_insert_delete] 2020-12-30T04:39:02.423+0000 d20020| | 2020-12-30T04:39:02.423+00:00 I  SHARDING 5262000 [RecoverRefreshThread] "Ignoring shard version change","attr":{"reshardingFields":{"uuid":{"$uuid":"6e22c09a-3051-43d4-861e-06f9629abb7a"},"state":"preparing-to-donate","recipientFields":{"donorShardIds":["shard0","shard1"],"existingUUID":{"$uuid":"0fe4b9ee-41d2-4855-8411-32539bc84657"},"originalNamespace":"test.foo"}},"collectionMetadata":"collection version: 2|1||5fec046614ff529dbac7fa05, shard version: 2|0||5fec046614ff529dbac7fa05"}
...
[js_test:resharding_replicate_updates_as_insert_delete] 2020-12-30T04:39:02.429+0000 d20022| 2020-12-30T04:39:02.426+00:00 I  SH_REFR  4619901 [CatalogCache-0] "Refreshed cached collection","attr":{"namespace":"test.system.resharding.0fe4b9ee-41d2-4855-8411-32539bc84657","newVersion":{"chunkVersion":{"0":{"$timestamp":{"t":2,"i":1}},"1":{"$oid":"5fec046614ff529dbac7fa05"}},"forcedRefreshSequenceNum":1,"epochDisambiguatingSequenceNum":6},"oldVersion":{"chunkVersion":"None","forcedRefreshSequenceNum":0,"epochDisambiguatingSequenceNum":0},"durationMillis":3}
[js_test:resharding_replicate_updates_as_insert_delete] 2020-12-30T04:39:02.429+0000 d20022| | 2020-12-30T04:39:02.426+00:00 I  SHARDING 5262001 [RecoverRefreshThread] "Creating recipient state machine","attr":{"reshardingFields":{"uuid":{"$uuid":"6e22c09a-3051-43d4-861e-06f9629abb7a"},"state":"cloning","recipientFields":{"fetchTimestamp":{"$timestamp":{"t":1609303142,"i":51}},"donorShardIds":["shard0","shard1"],"existingUUID":{"$uuid":"0fe4b9ee-41d2-4855-8411-32539bc84657"},"originalNamespace":"test.foo"}},"collectionMetadata":"collection version: 2|1||5fec046614ff529dbac7fa05, shard version: 2|1||5fec046614ff529dbac7fa05"}
...
[js_test:resharding_replicate_updates_as_insert_delete] 2020-12-30T04:39:02.431+0000 d20020| 2020-12-30T04:39:02.431+00:00 I  SH_REFR  4619901 [CatalogCache-0] "Refreshed cached collection","attr":{"namespace":"test.system.resharding.0fe4b9ee-41d2-4855-8411-32539bc84657","newVersion":{"chunkVersion":{"0":{"$timestamp":{"t":2,"i":1}},"1":{"$oid":"5fec046614ff529dbac7fa05"}},"forcedRefreshSequenceNum":1,"epochDisambiguatingSequenceNum":8},"oldVersion":{"chunkVersion":"None","forcedRefreshSequenceNum":0,"epochDisambiguatingSequenceNum":0},"durationMillis":3}
[js_test:resharding_replicate_updates_as_insert_delete] 2020-12-30T04:39:02.431+0000 d20020| | 2020-12-30T04:39:02.431+00:00 I  SHARDING 5262000 [RecoverRefreshThread] "Ignoring shard version change","attr":{"reshardingFields":{"uuid":{"$uuid":"6e22c09a-3051-43d4-861e-06f9629abb7a"},"state":"preparing-to-donate","recipientFields":{"donorShardIds":["shard0","shard1"],"existingUUID":{"$uuid":"0fe4b9ee-41d2-4855-8411-32539bc84657"},"originalNamespace":"test.foo"}},"collectionMetadata":"collection version: 2|1||5fec046614ff529dbac7fa05, shard version: 2|0||5fec046614ff529dbac7fa05"}
...
[js_test:resharding_replicate_updates_as_insert_delete] 2020-12-30T04:39:02.457+0000 d20022| 2020-12-30T04:39:02.457+00:00 D1 MIGRATE  5002300 [ReshardingRecipientService-0] "Creating temporary resharding collection","attr":{"originalNss":"test.foo"}
[js_test:resharding_replicate_updates_as_insert_delete] 2020-12-30T04:39:02.460+0000 d20022| 2020-12-30T04:39:02.460+00:00 I  SH_REFR  4619901 [CatalogCache-0] "Refreshed cached collection","attr":{"namespace":"test.foo","newVersion":{"chunkVersion":{"0":{"$timestamp":{"t":2,"i":1}},"1":{"$oid":"5fec046697d08cdb539562b8"}},"forcedRefreshSequenceNum":1,"epochDisambiguatingSequenceNum":7},"oldVersion":{"chunkVersion":"None","forcedRefreshSequenceNum":0,"epochDisambiguatingSequenceNum":0},"durationMillis":2}
[js_test:resharding_replicate_updates_as_insert_delete] 2020-12-30T04:39:02.461+0000 d20022| 2020-12-30T04:39:02.461+00:00 I  STORAGE  20320   [ReshardingRecipientService-0] "createCollection","attr":{"namespace":"test.system.resharding.0fe4b9ee-41d2-4855-8411-32539bc84657","uuidDisposition":"provided","uuid":{"uuid":{"$uuid":"6e22c09a-3051-43d4-861e-06f9629abb7a"}},"options":{"uuid":{"$uuid":"6e22c09a-3051-43d4-861e-06f9629abb7a"}}}

https://logkeeper.mongodb.org/lobster/build/66cd6439e38b9758dcdfff57bba7bf5a/test/5fec045654f248578176ab41#bookmarks=0%2C2548%2C2549%2C2555%2C2556%2C2561%2C2562%2C2567%2C2568%2C2569%2C10865&f~=000~%22createCollection%22.%2A%22test%5C.system&l=1

(These logs are from a patch build where the "Ignoring shard version change" and "Creating recipient state machine" messages have been added.)



 Comments   
Comment by Githook User [ 19/Mar/21 ]

Author:

{'name': 'Jordi Serra Torrens', 'email': 'jordi.serra-torrens@mongodb.com', 'username': 'jordist'}

Message: SERVER-55307: Complete TODO listed in SERVER-53539
Branch: master
https://github.com/mongodb/mongo/commit/87e14fa4bd8df77de512cdb85b72fa89abca353f

Comment by Jordi Serra Torrens [ 02/Mar/21 ]

max.hirschhorn I've filed SERVER-54874 to address the issue of ConfigServerCatalogCache refreshing inconsistent config.collections and config.chunks should the metadata change transaction have commited between in the time the two reads.

If, in addition to that, we ensure that for every modification to 'allowMigrations' and 'reshardingFields' on config.collections we bump the collection version (bump one chunk), then we don't need to change the collection epoch.

Comment by Max Hirschhorn [ 16/Feb/21 ]

I think the reason why this hasn't historically been an issue is that the collection metadata (collection UUID, shard key pattern, collection default collation) have never been allowed to change after they've been initially set unless the epoch has also changed (where the collection was dropped, re-created, and sharded again).

If we wanted to continue to bump the chunk version instead of the epoch, then I could imagine for resharding that we could include the collection version in the donor fields (for the existing sharded collection) and in the recipient fields (for the temporary resharding collection) and change createConfigDiffQuery() to also do {$lte: <collection version>}. This wouldn't help the allowMigrations setting in the config.collections document which is meant to also be used by the Sharded collections support all DDL operations project. I would want to get some input from the Sharding EMEA folks here. CC Kaloian Manassiev, Tommaso Tocci

kaloian.manassiev, jordi.serra-torrens, should we be having the change to the allowMigrations setting and the reshardingFields (each time) entail bumping the collection epoch then? Triggering a full refresh is more expensive but would at least guarantee the config.collections metadata is associated with the correct chunk version.

I'd like to do the work to make resharding be correct (but perhaps not efficient) if we don't plan on scheduling work now to do something more efficient.

Comment by Blake Oler [ 16/Feb/21 ]

jordi.serra-torrens reached out to me to update that the ticket we were depending on to fix this issue (SERVER-53105) will no longer fix it. The approach to 53105 didn't end up changing the catalog cache to use snapshot reads, since that approach was giving the EMEA team trouble. Flagging this for max.hirschhorn so we can bring it up for discussion again.

Comment by Max Hirschhorn [ 31/Dec/20 ]

I ran another patch build with some additional log messages (namely this one) and found that the "refreshedCollectionVersion" isn't consistent with the "refreshedReshardingFields". The config server is responding back with newer chunks than the associated collection metadata.

[js_test:resharding_replicate_updates_as_insert_delete] 2020-12-31T00:34:53.057+0000 d20020| 2020-12-31T00:34:53.056+00:00 I  SH_REFR  24108   [ShardServerCatalogCacheLoader::getChunksSince] "Cache loader remotely refreshed for collection","attr":{"namespace":"test.system.resharding.fec48268-09fc-4ce7-80f9-b33926b6c38e","oldCollectionVersion":{"0":{"$timestamp":{"t":1,"i":1}},"1":{"$oid":"5fed1cac26c57d68027aac34"}},"refreshedCollectionVersion":{"0":{"$timestamp":{"t":2,"i":1}},"1":{"$oid":"5fed1cac26c57d68027aac34"}},"refreshedReshardingFields":{"uuid":{"$uuid":"8f26efcc-4f27-4d8d-963e-58565c104d4d"},"state":"preparing-to-donate","recipientFields":{"donorShardIds":["shard0","shard1"],"existingUUID":{"$uuid":"fec48268-09fc-4ce7-80f9-b33926b6c38e"},"originalNamespace":"test.foo"}}}

The issue is pretty clear to me from looking at getChangedChunks() in config_server_catalog_cache_loader.cpp. The shard issues two separate reads with "majority" read concern level which makes the following scenario possible:

  1. The test client attempts to read from the temporary resharding collection. This triggers a shard version refresh on the shard (and router).
  2. The shard reads the config.collections entry for temporary resharding collection. It sees reshardingFields.state == "preparing-to-donate".
  3. The resharding coordinator performs a local replica set transaction which updates reshardingFields.state to "cloning" in the config.collections document and bumps the chunk version for a document per recipient shard in config.chunks.
  4. The shard reads the chunks since its last refresh and observes the effects of the now-committed transaction.

I think the reason why this hasn't historically been an issue is that the collection metadata (collection UUID, shard key pattern, collection default collation) have never been allowed to change after they've been initially set unless the epoch has also changed (where the collection was dropped, re-created, and sharded again).

If we wanted to continue to bump the chunk version instead of the epoch, then I could imagine for resharding that we could include the collection version in the donor fields (for the existing sharded collection) and in the recipient fields (for the temporary resharding collection) and change createConfigDiffQuery() to also do {$lte: <collection version>}. This wouldn't help the allowMigrations setting in the config.collections document which is meant to also be used by the Sharded collections support all DDL operations project. I would want to get some input from the Sharding EMEA folks here. CC kaloian.manassiev, tommaso.tocci


Steps to reproduce

python buildscripts/resmoke.py run --suite=sharding jstests/sharding/resharding_replicate_updates_as_insert_delete.js

diff --git a/jstests/sharding/resharding_replicate_updates_as_insert_delete.js b/jstests/sharding/resharding_replicate_updates_as_insert_delete.js
index 0afa0409b9..e0f69101eb 100644
--- a/jstests/sharding/resharding_replicate_updates_as_insert_delete.js
+++ b/jstests/sharding/resharding_replicate_updates_as_insert_delete.js
@@ -43,23 +43,11 @@ reshardingTest.withReshardingInBackground(  //
         ],
     },
     (tempNs) => {
-        // Wait for cloning to have finished on both recipient shards to know that the donor shards
-        // have begun including the "destinedRecipient" field in their oplog entries. It would
-        // technically be sufficient to only wait for cloning to have *started*, but querying the
-        // temporary resharding collection through mongos may cause the RecipientStateMachine to
-        // never be constructed on recipientShardNames[0].
-        //
-        // TODO SERVER-53539: Replace the assert.soon() with the following code.
-        //
-        // const tempColl = mongos.getCollection(tempNs);
-        // assert.soon(() => tempColl.findOne(docToUpdate) !== null);
-        assert.soon(() => {
-            const coordinatorDoc = mongos.getCollection("config.reshardingOperations").findOne({
-                nss: testColl.getFullName()
-            });
+        const tempColl = mongos.getCollection(tempNs);
 
-            return coordinatorDoc !== null && coordinatorDoc.state === "applying";
-        });
+        // Wait for cloning to have at least started to know that the donor shards have begun
+        // including the "destinedRecipient" field in their oplog entries.
+        assert.soon(() => tempColl.findOne(docToUpdate) !== null);
 
         // TODO SERVER-52683: Change assertion to say the update succeeds. Also capture the
         // operationTime associated with the write and assert the generated oplog entry is a
diff --git a/src/mongo/db/s/resharding/resharding_coordinator_service.cpp b/src/mongo/db/s/resharding/resharding_coordinator_service.cpp
index 828610bedc..0e7d20adda 100644
--- a/src/mongo/db/s/resharding/resharding_coordinator_service.cpp
+++ b/src/mongo/db/s/resharding/resharding_coordinator_service.cpp
@@ -829,6 +829,7 @@ ReshardingCoordinatorService::ReshardingCoordinator::_awaitAllDonorsReadyToDonat
         .then([this](ReshardingCoordinatorDocument coordinatorDocChangedOnDisk) {
             auto highestMinFetchTimestamp =
                 getHighestMinFetchTimestamp(coordinatorDocChangedOnDisk.getDonorShards());
+            sleepsecs(1);
             _updateCoordinatorDocStateAndCatalogEntries(CoordinatorStateEnum::kCloning,
                                                         coordinatorDocChangedOnDisk,
                                                         highestMinFetchTimestamp);
diff --git a/src/mongo/db/s/resharding/resharding_donor_recipient_common.cpp b/src/mongo/db/s/resharding/resharding_donor_recipient_common.cpp
index 55f928fd7e..d172f46b55 100644
--- a/src/mongo/db/s/resharding/resharding_donor_recipient_common.cpp
+++ b/src/mongo/db/s/resharding/resharding_donor_recipient_common.cpp
@@ -27,12 +27,16 @@
  *    it in the license file.
  */
 
+#define MONGO_LOGV2_DEFAULT_COMPONENT ::mongo::logv2::LogComponent::kSharding
+
 #include "mongo/platform/basic.h"
 
 #include "mongo/db/s/resharding/resharding_donor_recipient_common.h"
 
 #include <fmt/format.h>
 
+#include "mongo/logv2/log.h"
+
 namespace mongo {
 namespace resharding {
 
@@ -118,9 +122,18 @@ void processReshardingFieldsForRecipientCollection(OperationContext* opCtx,
     // RecipientStateMachine until all donor shards are known to be prepared for the recipient to
     // start cloning.
     if (reshardingFields.getState() != CoordinatorStateEnum::kCloning) {
+        LOGV2(5353900,
+              "Ignoring shard version change",
+              "reshardingFields"_attr = reshardingFields,
+              "collectionMetadata"_attr = metadata.toStringBasic());
         return;
     }
 
+    LOGV2(5353901,
+          "Creating recipient state machine",
+          "reshardingFields"_attr = reshardingFields,
+          "collectionMetadata"_attr = metadata.toStringBasic());
+
     auto recipientDoc =
         constructRecipientDocumentFromReshardingFields(opCtx, metadata, reshardingFields);
     createReshardingStateMachine<ReshardingRecipientService,
diff --git a/src/mongo/s/config_server_catalog_cache_loader.cpp b/src/mongo/s/config_server_catalog_cache_loader.cpp
index f0ddca35f3..552f4a3aa3 100644
--- a/src/mongo/s/config_server_catalog_cache_loader.cpp
+++ b/src/mongo/s/config_server_catalog_cache_loader.cpp
@@ -94,6 +94,10 @@ CollectionAndChangedChunks getChangedChunks(OperationContext* opCtx,
     // Diff tracker should *always* find at least one chunk if collection exists
     const auto diffQuery = createConfigDiffQuery(nss, startingCollectionVersion);
 
+    if (nss.isTemporaryReshardingCollection()) {
+        sleepsecs(5);
+    }
+
     // Query the chunks which have changed
     repl::OpTime opTime;
     const std::vector<ChunkType> changedChunks = uassertStatusOK(

Generated at Thu Feb 08 05:31:12 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.