[SERVER-61945] Resharding collection cloning may fail with NamespaceNotSharded when "nearest" read preference chooses secondary Created: 07/Dec/21  Updated: 29/Oct/23  Resolved: 09/Dec/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 5.3.0, 5.1.2, 5.0.6, 5.2.0-rc1

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Max Hirschhorn
Resolution: Fixed Votes: 0
Labels: sharding-nyc-subteam1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Problem/Incident
is caused by SERVER-60860 ReshardingCollectionCloner uses prima... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.2, v5.1, v5.0
Sprint: Sharding 2021-12-13
Participants:
Linked BF Score: 40
Story Points: 2

 Description   

It is possible due to a mirrored read or an earlier failed resharding operation for a secondary to be aware of the temporary resharding namespace and to believe the collection is unsharded. A primary is guaranteed to have refreshed its CatalogCache after the config.chunks entries have been written on the config server primary. However, there is no equivalent guarantee for secondaries. This can lead the call to CatalogCache::getShardedCollectionRoutingInfo() in the $_internalReshardingOwnershipMatch stage to throw a NamespaceNotSharded exception.

We can instead use CatalogCache::getShardedCollectionRoutingInfoWithRefresh() to ensure the secondary will have refreshed after the config.chunks entries have been written on the config server primary and also know the temporary resharding namespace is sharded.

[js_test:resharding_replicate_updates_as_insert_delete] d20770| 2021-12-06T22:22:48.821+00:00 I  SH_REFR  4619902 [CatalogCache-1] "Collection has found to be unsharded after refresh","attr":{"namespace":"test.system.resharding.bce940ec-7251-4fc2-9dbd-b45d5614a2aa","durationMillis":13}
[js_test:resharding_replicate_updates_as_insert_delete] d20770| 2021-12-06T22:22:48.821+00:00 I  SHARDING 21917   [RecoverRefreshThread] "Marking collection as unsharded","attr":{"namespace":"test.system.resharding.bce940ec-7251-4fc2-9dbd-b45d5614a2aa"}
[js_test:resharding_replicate_updates_as_insert_delete] d20771| 2021-12-06T22:22:48.823+00:00 I  SH_REFR  4619902 [CatalogCache-0] "Collection has found to be unsharded after refresh","attr":{"namespace":"test.system.resharding.bce940ec-7251-4fc2-9dbd-b45d5614a2aa","durationMillis":17}
[js_test:resharding_replicate_updates_as_insert_delete] d20771| 2021-12-06T22:22:48.823+00:00 I  SHARDING 21917   [RecoverRefreshThread] "Marking collection as unsharded","attr":{"namespace":"test.system.resharding.bce940ec-7251-4fc2-9dbd-b45d5614a2aa"}
...
[js_test:resharding_replicate_updates_as_insert_delete] d20772| 2021-12-06T22:22:49.630+00:00 E  RESHARD  5352400 [ReshardingRecipientService-0] "Operation-fatal error for resharding while cloning sharded collection","attr":{"sourceNamespace":"test.foo","outputNamespace":"test.system.resharding.bce940ec-7251-4fc2-9dbd-b45d5614a2aa","readTimestamp":{"$timestamp":{"t":1638829369,"i":4}},"error":"NamespaceNotSharded: Error on remote shard EC2AMAZ-6BUUU1A:20771 :: caused by :: Executor error during getMore :: caused by :: Expected collection test.system.resharding.bce940ec-7251-4fc2-9dbd-b45d5614a2aa to be sharded"}

[js_test:setfcv_reshard_collection] d20277| 2021-12-02T12:55:55.323+00:00 I  COMMAND  20332   [ReplWriterWorker-1] "CMD: drop","attr":{"namespace":"config.cache.chunks.reshardingDb.system.resharding.49b12d21-45eb-4d34-9e8b-22465a75d490"}
[js_test:setfcv_reshard_collection] d20277| 2021-12-02T12:55:55.325+00:00 I  SH_REFR  4619902 [CatalogCache-0] "Collection has found to be unsharded after refresh","attr":{"namespace":"reshardingDb.system.resharding.49b12d21-45eb-4d34-9e8b-22465a75d490","durationMillis":99}
...
[js_test:setfcv_reshard_collection] d20276| 2021-12-02T12:55:55.834+00:00 E  RESHARD  5352400 [ReshardingRecipientService-2] "Operation-fatal error for resharding while cloning sharded collection","attr":{"sourceNamespace":"reshardingDb.testColl","outputNamespace":"reshardingDb.system.resharding.49b12d21-45eb-4d34-9e8b-22465a75d490","readTimestamp":{"$timestamp":{"t":1638449755,"i":81}},"error":"NamespaceNotSharded: Error on remote shard ip-10-122-57-106.ec2.internal:20277 :: caused by :: Executor error during getMore :: caused by :: Expected collection reshardingDb.system.resharding.49b12d21-45eb-4d34-9e8b-22465a75d490 to be sharded"}



 Comments   
Comment by Githook User [ 09/Dec/21 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: SERVER-61945 Refresh temp resharding ns during collection cloning.

(cherry picked from commit 19e8f275378c4ba4a2941ae2eb10249c915ed0f1)
Branch: v5.0
https://github.com/mongodb/mongo/commit/b35ea30e385a3489c996068d52d135db77c1a36f

Comment by Githook User [ 09/Dec/21 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: SERVER-61945 Refresh temp resharding ns during collection cloning.

(cherry picked from commit 19e8f275378c4ba4a2941ae2eb10249c915ed0f1)
Branch: v5.2
https://github.com/mongodb/mongo/commit/eaf32be9b4bb909b0fc38a9326d7ffb61593f236

Comment by Githook User [ 09/Dec/21 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: SERVER-61945 Refresh temp resharding ns during collection cloning.

(cherry picked from commit 19e8f275378c4ba4a2941ae2eb10249c915ed0f1)
Branch: v5.1
https://github.com/mongodb/mongo/commit/99f8ffd22e9ac805c0d258024602e9154fb3b897

Comment by Githook User [ 09/Dec/21 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: SERVER-61945 Refresh temp resharding ns during collection cloning.
Branch: master
https://github.com/mongodb/mongo/commit/19e8f275378c4ba4a2941ae2eb10249c915ed0f1

Generated at Thu Feb 08 05:53:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.