Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-75806

analyzeShardKey command can fail with readConcern afterClusterTime error

    XMLWordPrintableJSON

Details

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major - P3 Major - P3
    • 7.0.0-rc0
    • None
    • None
    • None
    • Fully Compatible
    • ALL
    • Sharding NYC 2023-04-17
    • 135

    Description

      The step for calculating the read and write distribution metrics in the analyzeShardKey command works as follows:

      1. The client runs the analyzeShardKey command against some mongos. That mongos forwards the command to a random shard with readPreference "secondaryPreferred".
      2. The shardsvr node running the analyzeShardKey command generates the split points for the shard key being analyzed, and then persists them in the config.analyzeShardKeySplitPoints collection. If the node isn't a primary, it would run the insert command against the primary node using ScopedClientConnection (here). 
      3. The shardsvr node then sends an aggregate command with the $_analyzeShardKeyReadWriteDistribution stage to all shards that own chunks for the collection. The stage has the following spec, where 'splitPointsAfterClusterTime' is set to the operationTime in the response for insert command in step 2, and 'splitPointsShardId' set to its own shard id.

        {
             key: <object>,
             splitPointsFilter: <object>
             splitPointsAfterClusterTime: <timestamp>,
             splitPointsShardId: <shardId>
        }
        

      4. Each shard running the $_analyzeShardKeyReadWriteDistribution stage then runs an aggregate command against the shard 'splitPointsShardId' to fetch all split point documents. The aggregate command has a $match stage with filter set to 'splitPointsFilter' and readConcern with afterClusterTime set to 'splitPointsAfterClusterTime'. Using the split ponits, each shard then creates a synthetic routing table and does metrics calculation. 
      5. The shardsvr node from step 1 combines the metrics from all the shards.

      It turns out the use of ScopedClientConnection can result in some causal consistency issue in the case where the analyzeShardKey targets a secondary node, since ScopedClientConnection doesn't internally have a VectorClockMetadataHook (e.g. like here) which would read the clusterTime of every reply and advance the VectorClock. That is, if the inserts do not replicated to the node running the analyzeShardKey before the $_analyzeShardKeyReadWriteDistribution aggregate commands are sent out in step 3, the $match aggregate command against the config.analyzeShardKeySplitPoints collection can arrive with clusterTime less than the afterClusterTime in the command itself. If at that point the node the $match aggregation is running also hasn't replicated the inserts in step 2, then the command would fail here and this would cause the analyzeShardKey as a whole to fail. 

      Attachments

        Activity

          People

            cheahuychou.mao@mongodb.com Cheahuychou Mao
            cheahuychou.mao@mongodb.com Cheahuychou Mao
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: