-
Type: Task
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: None
-
None
-
Cluster Scalability
-
Fully Compatible
-
v8.0
-
Cluster Scalability 2024-07-22
-
0
$_analyzeShardKeyReadWriteDistribution has a step to get the collection's default collator which currently would throw NamespaceNotFound if the collection doesn't exist on the shard. This error code can be misleading for a collection on a sharded cluster in the following case:
- The analyzeShardKey command finished generating the split points on shard0.
- shard0 executed a cluster aggregate command with $_analyzeShardKeyReadWriteDistribution against all shards that owned chunks for the collection at that point, say shard0 and shard1, where shard1 is a config shard trying to transition to a dedicated config server.
- There is a chunk migration that moved the remaining data out of shard1. The transitionToDedicatedConfigServer command removed shard1 and dropped all the user databases on it.
- shard0 executed getMore commands against all the shards. On shard1, the AutoGetCollectionForReadCommand here didn't perform a shard version because the collection didn't exist and didn't throw StaleConfig here because the by design getMore command didn't have shard version.
- The getMore command on shard1 then threw NamespaceNotFound here.
The same applies to the case where there is a moveCollection in step 3 since reshardCollection also drops the original collection.
Given this, $_analyzeShardKeyReadWriteDistribution should just throw QueryPlanKilled instead of NamespaceNotFound here. That way, a user would just retry the analyzeShardKey command instead of getting confused why it says namespace not found when the collection is there in the cluster.