[SERVER-22815] Investigate if there is a better cutoff for an optimized $sample than 5% of the collection Created: 23/Feb/16  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Charlie Swanson Assignee: Backlog - Query Optimization
Resolution: Unresolved Votes: 0
Labels: eng-l
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Query Optimization
Backwards Compatibility: Fully Compatible
Participants:

 Description   

When SERVER-19182 was implemented, we chose 5% as the cutoff for when we will switch from the optimized $sampleFromRandomCursor to the normal $sample implementation.

The $sampleFromRandomCursor implementation will do repeated random walks over a tree (in currently supported storage engines), whereas the $sample implementation will do a full collection scan, then a top-k sort based on an injected random value.

It is thought that the $sample implementation will be faster after a certain threshold percentage. This is because a collection scan likely has a data access pattern of large sequential reads, where the random tree walks do a bunch of random point accesses. Especially on spinning disks, the former becomes more appealing as you look at a larger and larger percent of the collection.

We should do some benchmarking to see if 5% is a good cutoff for a variety of setups. It will likely depend on at least the following factors:

  • Storage engine
  • Type of disk
  • Amount of memory
  • Number of documents in the collection
  • Size of documents

It may be very hard to find a number that is suited to all combinations, but it may be that there is a better choice than 5%.



 Comments   
Comment by Thomas Rueckstiess [ 24/Feb/16 ]

I can share one data point, on WiredTiger, version 3.2.1, SSDs (MacBook Pro), all default settings:

I repeatedly $sampled for increasing numbers of documents (in steps of 1000) in a collection of 1M docs, and logged the time it took to return the first batch:

$sample size         time to first batch in ms
 
45000                    18
46000                    19
47000                    15
48000                    19
49000                    20
50000                    22
51000                    5821
52000                    6165
53000                    5962
54000                    6114
55000                    6068

As soon as we go above 5% of the collection (> 50k docs), the time to get the first batch jumps up massively (250x), because it switches to the collection scan with top-n sort which is blocking. Basically this means we have to wait for the full collection scan to be completed. Additionally, it also requires the "allowDiskUse: true" option for larger documents at this point due to the sort stage.

Generated at Thu Feb 08 04:01:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.