-
Type:
Improvement
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Cluster Scalability
-
None
-
None
-
None
-
None
-
None
-
None
-
None
When running analyzeShardKey with keyCharacteristics: true, both the monotonicity and cardinality/frequency passes perform a full sequential scan of the supporting index regardless of sampleSize. The sampling parameters only control how many results are kept, not how many index keys are read.For large collections (1B+ documents) with a small sampleSize (e.g. 1M), execution time is dominated by scanning ~1B index keys twice, even though only 1M values are ultimately used.
Both calculateMonotonicity and the cardinality/frequency aggregation scan nearly the entire supporting index even when sampleSize << collectionSize. For sampleSize=1M on a 1B collection, both passes iterate ~1B index keys to probabilistically collect 1M samples (rate = 1M/1B = 0.001).
- Monotonicity: exec->getNext() advances the cursor on every iteration, even when shouldSample is false https://github.com/10gen/mongo/blob/master/src/mongo/db/s/analyze_shard_key_cmd_util.cpp#L804-L806
- Cardinality: $sampleRate filter evaluates every document in the pipeline https://github.com/10gen/mongo/blob/master/src/mongo/db/s/analyze_shard_key_cmd_util.cpp#L222-L225
We need this improvement for sharding key advisor epic https://jira.mongodb.org/browse/CLOUDP-376926 since it will likely not perform for very large customer collections - where the final solution behavior makes analyzeShardKey runtime proportional to sampleSize parameter not collection size.