Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-22068

Add an option to $sample to perform a more statistically unbiased sample

    XMLWordPrintable

    Details

      Description

      The $sample stage currently has two algorithms to select a random sample:

      1. Using a random cursor (does a random walk over some B-tree like structure).
      2. A full collection scan, sorting by a random value.

      The latter strategy has a better statistical distribution, since it only relies on the random number generator, and doesn't depend on any trees being balanced. It is also better at weighting the results from shards with different amounts of data accordingly. The random walk approach has some special logic to approximate weighting per shard, but it is flawed because it only has an estimate of the number of owned documents on the shard.

      We should add an option to the $sample stage to force it to perform the scan + random sort approach. When this option is passed, it should probably use the better random number generator as well.

        Attachments

          Activity

            People

            Assignee:
            backlog-query-optimization Backlog - Query Optimization
            Reporter:
            charlie.swanson Charlie Swanson
            Participants:
            Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

              Dates

              Created:
              Updated: