Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 2.6.11, 3.0.6, 3.2.0
Component/s: Querying
Labels:
- bonsai

Assigned Teams:

Query Optimization
Backwards Compatibility:
Fully Compatible
Case:
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

The query engine currently selects a winning plan among n candidate plans by the following mechanism. We start executing each of the n plans for some short trial period. Once the trial period ends, we use stats collected from the brief trial execution to score the plans. The most highly scored plan becomes the winner. The scoring is based on a notion of "productivity", essentially the ratio of results produced per amount of work (where work is an artificial unit that is a proxy for CPU + IO time).

You can think of the trial execution period as a way to sample the data on the fly, a kind of adaptive query processing technique in which inferences about the data are made while the query is executing. The problem is that this sampling strategy may not accurately represent the true data distribution.

For example, suppose that some candidate plan p1 is scanning the index {a: 1} over the interval [0, 100000] in order to answer the query predicate {a: {$gte: 0, $lte: 100000}, b: "selectiveString"}. There is an alternative plan p2 which simply looks up all docs where "b" is equal to some rare string, "selectiveString", in the {b: 1} index. Clearly p2 is the better plan: instead of scanning some large index range and filtering out all the documents without the correct value for "b", we do a lookup based on the more selective predicate.

However, depending on the data, the plan ranking algorithm described above might lead us to choose suboptimal plan p1 instead of p2. Suppose that there is a correlation between the "a" and "b" fields so that all of the documents where "b" is "selectiveString" also have an "a" value of 0. Since p1 is scanning index {a: 1} in order, we encounter all the documents where "a" is 0 and "b" is "selectiveString" first during the trial execution period. This can spuriously make p1 look as good as p2! If instead we were to sample randomly from the data where "a" is on the interval [0, 10000], the query engine would have quickly observed that plan p1 is much slower than p2. Real data is often correlated, so despite the fairly contrived example, this can be a problem is practice.

is duplicated by

SERVER-21178 Super slow query and increased memory usage on inefficient range queries

Closed

SERVER-82548 MongoDB 7.0.2 SBE selects a different index when doing $in with a large array

Closed

is related to

SERVER-20619 Statistics-based query optimization

Backlog

related to

SERVER-13211 Optimal index not chosen for query plan when many indexes match same prefix

Closed

SERVER-46904 Efficiency of one $or branch during planning may not be representative of overall plan efficiency

Backlog

mentioned in: Page Loading...

(1 mentioned in)

Assignee:: [DO NOT USE] Backlog - Query Optimization
Reporter:: David Storch
Participants:: [DO NOT USE] Backlog - Query Optimization, David Storch, Hoyt Ren, Xiaochen Wu
Votes:: 12 Vote for this issue
Watchers:: 60 Start watching this issue

Created:: Sep 24 2015 05:38:10 PM UTC
Updated:: Jul 30 2025 03:12:20 PM UTC

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates