[SERVER-75806] analyzeShardKey command can fail with readConcern afterClusterTime error Created: 06/Apr/23  Updated: 29/Oct/23  Resolved: 10/Apr/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.0.0-rc0

Type: Bug Priority: Major - P3
Reporter: Cheahuychou Mao Assignee: Cheahuychou Mao
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding NYC 2023-04-17
Participants:
Linked BF Score: 135

 Description   

The step for calculating the read and write distribution metrics in the analyzeShardKey command works as follows:

  1. The client runs the analyzeShardKey command against some mongos. That mongos forwards the command to a random shard with readPreference "secondaryPreferred".
  2. The shardsvr node running the analyzeShardKey command generates the split points for the shard key being analyzed, and then persists them in the config.analyzeShardKeySplitPoints collection. If the node isn't a primary, it would run the insert command against the primary node using ScopedClientConnection (here). 
  3. The shardsvr node then sends an aggregate command with the $_analyzeShardKeyReadWriteDistribution stage to all shards that own chunks for the collection. The stage has the following spec, where 'splitPointsAfterClusterTime' is set to the operationTime in the response for insert command in step 2, and 'splitPointsShardId' set to its own shard id.

    {
         key: <object>,
         splitPointsFilter: <object>
         splitPointsAfterClusterTime: <timestamp>,
         splitPointsShardId: <shardId>
    }
    

  4. Each shard running the $_analyzeShardKeyReadWriteDistribution stage then runs an aggregate command against the shard 'splitPointsShardId' to fetch all split point documents. The aggregate command has a $match stage with filter set to 'splitPointsFilter' and readConcern with afterClusterTime set to 'splitPointsAfterClusterTime'. Using the split ponits, each shard then creates a synthetic routing table and does metrics calculation. 
  5. The shardsvr node from step 1 combines the metrics from all the shards.

It turns out the use of ScopedClientConnection can result in some causal consistency issue in the case where the analyzeShardKey targets a secondary node, since ScopedClientConnection doesn't internally have a VectorClockMetadataHook (e.g. like here) which would read the clusterTime of every reply and advance the VectorClock. That is, if the inserts do not replicated to the node running the analyzeShardKey before the $_analyzeShardKeyReadWriteDistribution aggregate commands are sent out in step 3, the $match aggregate command against the config.analyzeShardKeySplitPoints collection can arrive with clusterTime less than the afterClusterTime in the command itself. If at that point the node the $match aggregation is running also hasn't replicated the inserts in step 2, then the command would fail here and this would cause the analyzeShardKey as a whole to fail. 



 Comments   
Comment by Githook User [ 10/Apr/23 ]

Author:

{'name': 'Cheahuychou Mao', 'email': 'mao.cheahuychou@gmail.com', 'username': 'cheahuychou'}

Message: SERVER-75806 Make the analyzeShardKey machinery run commands via the ReplicaSetNodeProcessInterface executor or Grid fixed executor
Branch: master
https://github.com/mongodb/mongo/commit/b48396832c7c1fa5c9b903b86852e63822cb17ef

Generated at Thu Feb 08 06:31:06 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.