[SERVER-22068] Add an option to $sample to perform a more statistically unbiased sample Created: 05/Jan/16  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: None

Type: New Feature Priority: Major - P3
Reporter: Charlie Swanson Assignee: Backlog - Query Optimization
Resolution: Unresolved Votes: 1
Labels: expression
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Assigned Teams:
Query Optimization
Participants:

 Description   

The $sample stage currently has two algorithms to select a random sample:

  1. Using a random cursor (does a random walk over some B-tree like structure).
  2. A full collection scan, sorting by a random value.

The latter strategy has a better statistical distribution, since it only relies on the random number generator, and doesn't depend on any trees being balanced. It is also better at weighting the results from shards with different amounts of data accordingly. The random walk approach has some special logic to approximate weighting per shard, but it is flawed because it only has an estimate of the number of owned documents on the shard.

We should add an option to the $sample stage to force it to perform the scan + random sort approach. When this option is passed, it should probably use the better random number generator as well.


Generated at Thu Feb 08 03:59:18 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.