[SERVER-75594] Make analyzeShardKey command only use one config collection to store split points Created: 03/Apr/23  Updated: 29/Oct/23  Resolved: 07/Apr/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.0.0-rc0

Type: Task Priority: Major - P3
Reporter: Cheahuychou Mao Assignee: Cheahuychou Mao
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Issue split
split from SERVER-75532 Investigate the high variability of t... Closed
Backwards Compatibility: Fully Compatible
Sprint: Sharding NYC 2023-04-17
Participants:
Linked BF Score: 25

 Description   

Currently, the step for calculating the read and write distribution metrics in the analyzeShardKey command works as follows:

  1. The shard running the analyzeShardKey command generates the split points for the shard key being analyzed, and then persists them in a temporary config collection named config.analyzeShardKey.splitPoints.<collUuid>.<tempCollUuid>.
  2. The shard sends an aggregate command with the $_analyzeShardKeyReadWriteDistribution stage to all shards that own chunks for the collection. The stage has the following spec:

    {
         key: <object>,
         splitPointsNss: <namespace>
         splitPointsAfterClusterTime: <timestamp>,
         splitPointsShardId: <shardId> // Only set when running on a sharded cluster
    }
    

  3. The shards running the $_analyzeShardKeyReadWriteDistribution stage then fetch all split point documents in the collection 'splitPointsNss' and create a synthetic routing table and do metrics calculation based on that.
  4. The shard from step 1 combines the metrics and drop the split point collection.

Making each command have its own split point collection is clean in a way but it has some notable downsides:

  • If an interrupt (e.g. due to shutdown) occurs in step 4, the temporary collection would never get dropped. This can be seen in BFG-1859782, BFG-1859158, BFG-1858992 and more.
  • It is expensive to keep creating and dropping collections. The command currently also need to drop the collection before it can return.  
     

One can try to set up a periodic job to drop collections with this "config.analyzeShardKey.splitPoints.*" prefix. However, it is hard to differentiate between dangling collections and collections that are actually being used by some in-progress analyzeShardKey command.

Given this, the analyzeShardKey command should instead only use one config collection for storing split points, and rely on a TTL index to automatically clean up documents for a command that has already returned. That is, the stage should have the following spec, where 'splitPointsFilter' is the filter that only match the split point documents generated by a particular command.

{
     key: <object>,
     splitPointsFilter: <object>
     splitPointsAfterClusterTime: <timestamp>,
     splitPointsShardId: <shardId>
}

The users are likely to run a lot of analyzeShardKey commands back to back so there shouldn't be a lot of documents to filter out during the read.



 Comments   
Comment by Githook User [ 07/Apr/23 ]

Author:

{'name': 'Cheahuychou Mao', 'email': 'mao.cheahuychou@gmail.com', 'username': 'cheahuychou'}

Message: SERVER-75594 Make analyzeShardKey command only use one config collection to store split points
Branch: master
https://github.com/mongodb/mongo/commit/138c81833b5902fc9268b56992231c2480a7a270

Comment by Adi Zaimi [ 07/Apr/23 ]

Thanks, I went back to the design doc to get more clarity as well:

  • The shard that runs the analyzeShardKey command will write the split points to a local temporary collection. After that, it will send an aggregation command with the new $_analyzeShardKeyReadWriteDistribution stage to all the shards. 
  • This aggregation stage will cause each shard to run a find command and getMore commands (as needed) against the temporary collection on that shard to get the split points, and do the metrics calculation based on the resulting synthetic routing table.
Comment by Adi Zaimi [ 07/Apr/23 ]

Right, so the documents have a ttl of 15min; I am not understanding in what way that is enough time. If you can remind me please, the split points created by an analyzeShardKey command are needed only briefly and not while we are analyzing the keys?

Comment by Adi Zaimi [ 07/Apr/23 ]

Some questions when reading the above:
 - How do we ensure that the documents stay in the collection during the time that we need them?

 - Cleanup to drop the unified collection will be left to the user to do manually?

Generated at Thu Feb 08 06:30:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.