Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-92202

$_analyzeShardKeyReadWriteDistribution stage should throw QueryPlanKilled instead of NamespaceNotFound

    • Type: Icon: Task Task
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 8.1.0-rc0, 8.0.0-rc13
    • Affects Version/s: None
    • Component/s: None
    • None
    • Cluster Scalability
    • Fully Compatible
    • v8.0
    • Cluster Scalability 2024-07-22
    • 0

      $_analyzeShardKeyReadWriteDistribution has a step to get the collection's default collator which currently would throw NamespaceNotFound if the collection doesn't exist on the shard. This error code can be misleading for a collection on a sharded cluster in the following case:

      1. The analyzeShardKey command finished generating the split points on shard0.
      2. shard0 executed a cluster aggregate command with $_analyzeShardKeyReadWriteDistribution against all shards that owned chunks for the collection at that point, say shard0 and shard1, where shard1 is a config shard trying to transition to a dedicated config server.
        • The aggregate command was execute with batchSize: 0 to make the router (shard0) and the shards agree on the shard version first.
        • The AutoGetCollectionForReadCommandMaybeLockFree here performed a shard version check here and didn't see anything stale. The cursor was established.
      3. There is a chunk migration that moved the remaining data out of shard1. The transitionToDedicatedConfigServer command removed shard1 and dropped all the user databases on it.
      4. shard0 executed getMore commands against all the shards. On shard1, the AutoGetCollectionForReadCommand here didn't perform a shard version because the collection didn't exist and didn't throw StaleConfig here because the by design getMore command didn't have shard version.
      5. The getMore command on shard1 then threw NamespaceNotFound here.

      The same applies to the case where there is a moveCollection in step 3 since reshardCollection also drops the original collection.

      Given this, $_analyzeShardKeyReadWriteDistribution should just throw QueryPlanKilled instead of NamespaceNotFound here. That way, a user would just retry the analyzeShardKey command instead of getting confused why it says namespace not found when the collection is there in the cluster. 

            Assignee:
            cheahuychou.mao@mongodb.com Cheahuychou Mao
            Reporter:
            cheahuychou.mao@mongodb.com Cheahuychou Mao
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: