[SERVER-45364] Query Planner should estimate cost of each predicate Created: 06/Jan/20  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Querying
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: William Byrne III Assignee: Backlog - Query Optimization
Resolution: Unresolved Votes: 0
Labels: qopt-team
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File elemTest.js     File elemTest.out.compact    
Issue Links:
Duplicate
is duplicated by SERVER-52619 Performance Regression on 4.4 compare... Closed
is duplicated by SERVER-45308 Alphabetical order of field names use... Closed
Related
is related to SERVER-37530 Provide a way to cause a well-defined... Backlog
Assigned Teams:
Query Optimization
Participants:
Case:

 Description   

This can be used to choose a more efficient order of execution for predicates like $or and $elemMatch.

Original Title

The Query Parser reorders $elemMatch ahead of equality conditions when evaluating a $or set.

Original Description

I created a collection with 200 documents, all in this format:

  { _id: ObjectId,
    aphaField: true,
    giantArray: [array of 20,000 documents like {ae: intValue}]
    zuluField: true }

The values for giantArray.ae are random values, but the first 19,999 are all > 17.
The last one has value 11.

Then I ran find(qry).count() against the collection, where qry was each of these filters:
1. {aphaField: {$eq: true}}
2. {giantArray: {$elemMatch: {ae: {$eq: 11}}}}
3. {zuluField: {$eq: true}}

I ran each query 21 times. The first time was to warm the cache, and then I recorded how long the next 20 executions took. All queries used a COLLSCAN as the collection has no indexes beyond the default one on _id, and all scanned all 100 documents.

The equality matches on the boolean fields alone ({aphaField: {$eq: true}} and {zuluField: {$eq: true}}) executed 20 times in 8 milliseconds. The $elemMatch condition on the nested array field "giantArray.ae" alone was slower, with 20 executions taking 4,616 milliseconds.

This is not unexpected, as the singleton boolean fields are very fast to test, whereas all 20,000 array members had to be examined to find the one with value 11.

<br>

Next I joined the equality tests on the boolean fields and the $elemMatch condition within a $or, both as {$or: [X, Y]} and {$or: [Y, X]}, and ran them 21 times each too:

  • {$or: [\{aphaField: {$eq: true}}, \{zuluField: {$eq: true}}]}
  • {$or: [\{aphaField: {$eq: true}}, \{giantArray: {$elemMatch: {ae: {$eq: 11}}}}]}
  • {$or: [\{giantArray: {$elemMatch: {ae: {$eq: 11}}}}, \{aphaField: {$eq: true}}]}
  • {$or: [\{giantArray: {$elemMatch: {ae: {$eq: 11}}}}, \{zuluField: {$eq: true}}]}
  • {$or: [/\{zuluField: /{$eq: true/}/}, /\{aphaField: /{$eq: true/}/}]/}
  • {$or: [\{zuluField: {$eq: true}}, \{giantArray: {$elemMatch: {ae: {$eq: 11}}}}]}

When X and Y were simple equality tests on boolean fields (i.e. no $elemMatch), both {$or: [X, Y]} and {$or: [Y, X]} executed 20 times in 8 to 10 milliseconds. However, when either X or Y was the $elemMatch condition, the Query Planner parser reordered THAT condition to be FIRST in the evaluation of the $or. That resulted in 20 executions of any query involving $elemMatch taking between 4568 and 4915 milliseconds.

SERVER-45308) shows that the Query Planner parser reorders conditions alphabetically, but $elemMatch is clearly a special case, perhaps from a different phase in the parsing. The effect of putting the $elemMatch conditions first was poorer performance because conditions are evaluated in that order too.

Evaluating a condition against an array field will typically take longer than one on a simple field, because the array will generally have multiple values to be tested. If the condition on the boolean field had been evaluated first these queries would have completed in much less time.

Conclusions
===========
In SERVER-45308) my conclusion was that no backend code change could help, as due to MongoDB's flexible schema the Query Parser cannot tell that {{

{"giantArray.ae": 11}

}} is a condition on an array field and not just a nested sub-field.

In contrast, when the query filter employs $elemMatch it should be obvious to the Query Parser that the field involved is an array of documents (or the end user would not have used that operator).

I suggest that the Query Parser be changed to order $elemMatch conditions LAST within an $or set. Overall I believe this will result in better performance.



 Comments   
Comment by Charlie Swanson [ 07/Jan/20 ]

Thanks for the request william.byrne. I've generalized the title a bit since the query team believes this is more generally applicable.

charlie.swanson reminder to search for whether this request already exists.

Comment by William Byrne III [ 06/Jan/20 ]

The attached files are a reproduction script and it's output (reformatted to be more compact).

Generated at Thu Feb 08 05:08:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.