The $sample stage currently has two algorithms to select a random sample:
- Using a random cursor (does a random walk over some B-tree like structure).
- A full collection scan, sorting by a random value.
The latter strategy has a better statistical distribution, since it only relies on the random number generator, and doesn't depend on any trees being balanced. It is also better at weighting the results from shards with different amounts of data accordingly. The random walk approach has some special logic to approximate weighting per shard, but it is flawed because it only has an estimate of the number of owned documents on the shard.
We should add an option to the $sample stage to force it to perform the scan + random sort approach. When this option is passed, it should probably use the better random number generator as well.