SERVER-19182 was implemented, we chose 5% as the cutoff for when we will switch from the optimized $sampleFromRandomCursor to the normal $sample implementation.
The $sampleFromRandomCursor implementation will do repeated random walks over a tree (in currently supported storage engines), whereas the $sample implementation will do a full collection scan, then a top-k sort based on an injected random value.
It is thought that the $sample implementation will be faster after a certain threshold percentage. This is because a collection scan likely has a data access pattern of large sequential reads, where the random tree walks do a bunch of random point accesses. Especially on spinning disks, the former becomes more appealing as you look at a larger and larger percent of the collection.
We should do some benchmarking to see if 5% is a good cutoff for a variety of setups. It will likely depend on at least the following factors:
- Storage engine
- Type of disk
- Amount of memory
- Number of documents in the collection
- Size of documents
It may be very hard to find a number that is suited to all combinations, but it may be that there is a better choice than 5%.