Two things that are probably worth clarifying:
- The 5% threshold is not configurable. It is thought to be a good approximation of the cutoff value where scanning the entire collection will be faster than that many random I/Os.
- If we are not under the 5% threshold, it's worth saying that we will do a top-k sort (where k = sample size) by a generated random value. This top-k sort can possibly spill to disk if K documents are larger than 100MB, and so allowDiskUse may need to be used.
- related to
SERVER-72518 Make 5% random-cursor $sample cutoff configurable