Loading...

XML

Word

Printable

JSON

Type: New Feature
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Aggregation Framework
Labels:
- expression

Assigned Teams:

Query Optimization
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

The $sample stage currently has two algorithms to select a random sample:

Using a random cursor (does a random walk over some B-tree like structure).
A full collection scan, sorting by a random value.

The latter strategy has a better statistical distribution, since it only relies on the random number generator, and doesn't depend on any trees being balanced. It is also better at weighting the results from shards with different amounts of data accordingly. The random walk approach has some special logic to approximate weighting per shard, but it is flawed because it only has an estimate of the number of owned documents on the shard.

We should add an option to the $sample stage to force it to perform the scan + random sort approach. When this option is passed, it should probably use the better random number generator as well.

Assignee:: [DO NOT USE] Backlog - Query Optimization
Reporter:: Charlie Swanson
Participants:: [DO NOT USE] Backlog - Query Optimization, Charlie Swanson
Votes:: 1 Vote for this issue
Watchers:: 9 Start watching this issue

Created:: Jan 05 2016 07:37:06 PM UTC
Updated:: Dec 06 2022 04:36:44 AM UTC

Details

Description

Attachments

Activity

People

Dates

PagerDuty