Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 7.0.0-rc0
Affects Version/s: None
Component/s: None
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Sprint:
Sharding NYC 2023-04-17
Linked BF Score:
135
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

The step for calculating the read and write distribution metrics in the analyzeShardKey command works as follows:

The client runs the analyzeShardKey command against some mongos. That mongos forwards the command to a random shard with readPreference "secondaryPreferred".
The shardsvr node running the analyzeShardKey command generates the split points for the shard key being analyzed, and then persists them in the config.analyzeShardKeySplitPoints collection. If the node isn't a primary, it would run the insert command against the primary node using ScopedClientConnection (here).
The shardsvr node then sends an aggregate command with the $_analyzeShardKeyReadWriteDistribution stage to all shards that own chunks for the collection. The stage has the following spec, where 'splitPointsAfterClusterTime' is set to the operationTime in the response for insert command in step 2, and 'splitPointsShardId' set to its own shard id.
```
{
     key: <object>,
     splitPointsFilter: <object>
     splitPointsAfterClusterTime: <timestamp>,
     splitPointsShardId: <shardId>
}
```
Each shard running the $_analyzeShardKeyReadWriteDistribution stage then runs an aggregate command against the shard 'splitPointsShardId' to fetch all split point documents. The aggregate command has a $match stage with filter set to 'splitPointsFilter' and readConcern with afterClusterTime set to 'splitPointsAfterClusterTime'. Using the split ponits, each shard then creates a synthetic routing table and does metrics calculation.
The shardsvr node from step 1 combines the metrics from all the shards.

It turns out the use of ScopedClientConnection can result in some causal consistency issue in the case where the analyzeShardKey targets a secondary node, since ScopedClientConnection doesn't internally have a VectorClockMetadataHook (e.g. like here) which would read the clusterTime of every reply and advance the VectorClock. That is, if the inserts do not replicated to the node running the analyzeShardKey before the $_analyzeShardKeyReadWriteDistribution aggregate commands are sent out in step 3, the $match aggregate command against the config.analyzeShardKeySplitPoints collection can arrive with clusterTime less than the afterClusterTime in the command itself. If at that point the node the $match aggregation is running also hasn't replicated the inserts in step 2, then the command would fail here and this would cause the analyzeShardKey as a whole to fail.

Assignee:: Cheahuychou Mao
Reporter:: Cheahuychou Mao
Participants:: Cheahuychou Mao, Githook User
Votes:: 0 Vote for this issue
Watchers:: 3 Start watching this issue

Created:: Apr 06 2023 06:14:15 PM UTC
Updated:: Oct 29 2023 09:23:27 PM UTC
Resolved:: Apr 10 2023 08:13:56 PM UTC

Details

Description

Attachments

Activity

People

Dates