Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Aggregation Framework
Labels:
- eng-l

Assigned Teams:

Query Optimization
Backwards Compatibility:
Fully Compatible
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

When ~~SERVER-19182~~ was implemented, we chose 5% as the cutoff for when we will switch from the optimized $sampleFromRandomCursor to the normal $sample implementation.

The $sampleFromRandomCursor implementation will do repeated random walks over a tree (in currently supported storage engines), whereas the $sample implementation will do a full collection scan, then a top-k sort based on an injected random value.

It is thought that the $sample implementation will be faster after a certain threshold percentage. This is because a collection scan likely has a data access pattern of large sequential reads, where the random tree walks do a bunch of random point accesses. Especially on spinning disks, the former becomes more appealing as you look at a larger and larger percent of the collection.

We should do some benchmarking to see if 5% is a good cutoff for a variety of setups. It will likely depend on at least the following factors:

Storage engine
Type of disk
Amount of memory
Number of documents in the collection
Size of documents

It may be very hard to find a number that is suited to all combinations, but it may be that there is a better choice than 5%.

Assignee:: [DO NOT USE] Backlog - Query Optimization
Reporter:: Charlie Swanson
Participants:: [DO NOT USE] Backlog - Query Optimization, Charlie Swanson, Thomas Rueckstiess
Votes:: 0 Vote for this issue
Watchers:: 8 Start watching this issue

Created:: Feb 23 2016 04:31:52 PM UTC
Updated:: Dec 06 2022 04:32:32 AM UTC

Details

Description

Attachments

Activity

People

Dates