[SERVER-22815] Investigate if there is a better cutoff for an optimized $sample than 5% of the collection Created: 23/Feb/16 Updated: 06/Dec/22 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Aggregation Framework |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Charlie Swanson | Assignee: | Backlog - Query Optimization |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | eng-l | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Assigned Teams: |
Query Optimization
|
| Backwards Compatibility: | Fully Compatible |
| Participants: |
| Description |
|
When The $sampleFromRandomCursor implementation will do repeated random walks over a tree (in currently supported storage engines), whereas the $sample implementation will do a full collection scan, then a top-k sort based on an injected random value. It is thought that the $sample implementation will be faster after a certain threshold percentage. This is because a collection scan likely has a data access pattern of large sequential reads, where the random tree walks do a bunch of random point accesses. Especially on spinning disks, the former becomes more appealing as you look at a larger and larger percent of the collection. We should do some benchmarking to see if 5% is a good cutoff for a variety of setups. It will likely depend on at least the following factors:
It may be very hard to find a number that is suited to all combinations, but it may be that there is a better choice than 5%. |
| Comments |
| Comment by Thomas Rueckstiess [ 24/Feb/16 ] | |||||||||||||
|
I can share one data point, on WiredTiger, version 3.2.1, SSDs (MacBook Pro), all default settings: I repeatedly $sampled for increasing numbers of documents (in steps of 1000) in a collection of 1M docs, and logged the time it took to return the first batch:
As soon as we go above 5% of the collection (> 50k docs), the time to get the first batch jumps up massively (250x), because it switches to the collection scan with top-n sort which is blocking. Basically this means we have to wait for the full collection scan to be completed. Additionally, it also requires the "allowDiskUse: true" option for larger documents at this point due to the sort stage. |