Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-22068

Add an option to $sample to perform a more statistically unbiased sample

    • Type: Icon: New Feature New Feature
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Aggregation Framework
    • Query Optimization

      The $sample stage currently has two algorithms to select a random sample:

      1. Using a random cursor (does a random walk over some B-tree like structure).
      2. A full collection scan, sorting by a random value.

      The latter strategy has a better statistical distribution, since it only relies on the random number generator, and doesn't depend on any trees being balanced. It is also better at weighting the results from shards with different amounts of data accordingly. The random walk approach has some special logic to approximate weighting per shard, but it is flawed because it only has an estimate of the number of owned documents on the shard.

      We should add an option to the $sample stage to force it to perform the scan + random sort approach. When this option is passed, it should probably use the better random number generator as well.

            backlog-query-optimization [DO NOT USE] Backlog - Query Optimization
            charlie.swanson@mongodb.com Charlie Swanson
            1 Vote for this issue
            7 Start watching this issue