Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 8.1.0-rc0, 8.0.0-rc13
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Cluster Scalability
Backwards Compatibility:
Fully Compatible
Backport Requested:

v8.0
Sprint:
Cluster Scalability 2024-07-22
Linked BF Score:
0
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

$_analyzeShardKeyReadWriteDistribution has a step to get the collection's default collator which currently would throw NamespaceNotFound if the collection doesn't exist on the shard. This error code can be misleading for a collection on a sharded cluster in the following case:

The analyzeShardKey command finished generating the split points on shard0.
shard0 executed a cluster aggregate command with $_analyzeShardKeyReadWriteDistribution against all shards that owned chunks for the collection at that point, say shard0 and shard1, where shard1 is a config shard trying to transition to a dedicated config server.
- The aggregate command was execute with batchSize: 0 to make the router (shard0) and the shards agree on the shard version first.
- The AutoGetCollectionForReadCommandMaybeLockFree here performed a shard version check here and didn't see anything stale. The cursor was established.
There is a chunk migration that moved the remaining data out of shard1. The transitionToDedicatedConfigServer command removed shard1 and dropped all the user databases on it.
shard0 executed getMore commands against all the shards. On shard1, the AutoGetCollectionForReadCommand here didn't perform a shard version because the collection didn't exist and didn't throw StaleConfig here because the by design getMore command didn't have shard version.
The getMore command on shard1 then threw NamespaceNotFound here.

The same applies to the case where there is a moveCollection in step 3 since reshardCollection also drops the original collection.

Given this, $_analyzeShardKeyReadWriteDistribution should just throw QueryPlanKilled instead of NamespaceNotFound here. That way, a user would just retry the analyzeShardKey command instead of getting confused why it says namespace not found when the collection is there in the cluster.

Assignee:: Cheahuychou Mao
Reporter:: Cheahuychou Mao
Participants:: Cheahuychou Mao, Githook User
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: Jul 08 2024 07:39:14 PM UTC
Updated:: Jul 10 2024 05:45:36 AM UTC
Resolved:: Jul 10 2024 05:34:29 AM UTC

Details

Description

Attachments

Activity

People

Dates