analyzeShardKey: make index scan proportional to sampleSize, not collection size

    • Type: Improvement
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Cluster Scalability
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      When running analyzeShardKey with keyCharacteristics: true, both the monotonicity and cardinality/frequency passes perform a full sequential scan of the supporting index regardless of sampleSize. The sampling parameters only control how many results are kept, not how many index keys are read.For large collections (1B+ documents) with a small sampleSize (e.g. 1M), execution time is dominated by scanning ~1B index keys twice, even though only 1M values are ultimately used.

      Both calculateMonotonicity and the cardinality/frequency aggregation scan nearly the entire supporting index even when sampleSize << collectionSize. For sampleSize=1M on a 1B collection, both passes iterate ~1B index keys to probabilistically collect 1M samples (rate = 1M/1B = 0.001).

      We need this improvement for sharding key advisor epic https://jira.mongodb.org/browse/CLOUDP-376926 since it will likely not perform for very large customer collections - where the final solution behavior makes analyzeShardKey runtime proportional to sampleSize parameter not collection size.

            Assignee:
            Unassigned
            Reporter:
            Alex Dambrouski
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: