[SERVER-45308] Alphabetical order of field names used in an $or clause drives evaluation order and thus affects performance Created: 27/Dec/19  Updated: 07/Jan/20  Resolved: 07/Jan/20

Status: Closed
Project: Core Server
Component/s: Aggregation Framework, Querying
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: William Byrne III Assignee: Charlie Swanson
Resolution: Duplicate Votes: 0
Labels: qopt-team
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File orTest.js     File orTest.out    
Issue Links:
Duplicate
duplicates SERVER-45364 Query Planner should estimate cost of... Backlog
Related
related to SERVER-37530 Provide a way to cause a well-defined... Backlog
Sprint: Query 2020-01-13
Participants:
Case:

 Description   

Synopsis:

Because the Query Parser sorts the conditions in a $or clause and then evaluates them in that order, changing the name of the field can affect performance.

Detailed explanation:

I created a collection with 200 documents, all in this format:

  { _id: ObjectId,
    firstField: true,
    giantArray: [array of 20,000 members like {ae: intValue}]
    lastField: true }

The values for giantArray.ae are random values, but the first 19,999 are all > 17.
The last one has value 11.

Single Condition Tests
================
I ran find(qry).count() against the collection, where qry was each of these filters:
1. {firstField: true}
2. {"giantArray.ae": 11}
3. {lastField: true}

I ran each query 21 times. The first time was to warm the cache, and then I recorded how long the next 20 executions took. All queries used a COLLSCAN as the collection has no indexes beyond the default one on _id, and all scanned all 100 documents.

Single Test Results:
==============

  • The equality matches on each of the boolean fields (firstField and lastField) alone both executed 20 times in 8 to 18 milliseconds.
  • The equality match against the nested array field "giantArray.ae" alone was much slower, with 20 executions taking 12,332 milliseconds.

This is not unexpected, as the singleton boolean fields were very fast to test, whereas all 20,000 array members had to be examined to find the one with value 11.

$or Tests
=======
Next I combined the equality tests on the boolean and array fields into $or joined pairs, in both orders (i.e. {$or: [X, Y]} and {$or: [Y, X]}:

 - {$or: [{firstField: true}, {lastField: true}]}
 - {$or: [{firstField: true}, {"giantArray.ae": 11}]}
 
 - {$or: [{"giantArray.ae": 11}, {firstField: true}]}
 - {$or: [{"giantArray.ae": 11}, {lastField: true}]}
 
 - {$or: [{lastField: true}, {firstField: true}]}
 - {$or: [{lastField: true}, {"giantArray.ae": 11}]}

$or Joined Tests Results:
==================

  • Both {$or: [X, Y]} and {$or: [Y, X]} gave roughly the same performance. This is because the Query Planner sorts the members of the $or set when it parses the query. You can see this by comparing the input query with the explain().queryPlanner.parsedQuery values: both {$or: [X, Y]} and {$or: [Y, X]} are parsed to {$or: [X, Y]}.

Here the significance of the field names becomes apparent.

  • If {firstField: true} was one of the conditions in the $or, the parser sorted it to be first, and the query executed 20 times in 10 to 24 milliseconds. This is similar to the time to execute queries with {firstField: true}

    by itself.

  • In contrast {lastField: true} paired with {"giantArray.ae": 11} (in either order) took over 1200 milliseconds, because the parser sorts the condition on the array field to be first. This is similar to the time to to execute queries with {"giantArray.ae": 11} by itself.

Conclusion:
===========
If the Query Planner did not sort the members of the $or set when it parsed the query, the order of the conditions would affect the order of execution. However, I believe the sort is required so that functionally identical queries can be matched to the same cached plan.

I do not think there is a backend code change to help with this behaviour, as due to MongoDB's flexible schema the Query Parser cannot not know which fields are arrays before examining them. This SERVER ticket might only serve to socument this behaviour, and to pointy out the workaround.

Workaround:
=========

  • Give array fields names that start late in the alphabet, so that conditions on them get sorted after conditions on simpler fields.


 Comments   
Comment by Charlie Swanson [ 07/Jan/20 ]

And to further elaborate one point: william.byrne is correct that the query planner sorts the predicates for caching purposes, but one could imagine the planner distinguishing between the "canonical" or caching order of the predicates from the actual execution order. This would certainly not be easy to do, but I think other systems have such analysis so it's a reasonable request.

Comment by Charlie Swanson [ 07/Jan/20 ]

Ok I'm going to close this as a duplicate of SERVER-45364. I think that's the closest in similarity. As bduncan@visualmining.com points out, there are possible changes the server could take to pick a better/more efficient ordering.

This is definitely related to SERVER-37530 as a workaround if the user knows a better order, so I will also link that.

Comment by Charlie Swanson [ 07/Jan/20 ]

charlie.swanson to read up and make the appropriate number of duplicates.

Comment by Bruce Duncan [ 29/Dec/19 ]

This issue is based on one of my support cases. I believe this issue can be considered a duplicate of SERVER-37530, just as SERVER-45123 was. SERVER-37530 talks about short-circuit evaluation in `$and` and relates it to expressions that produce errors, whereas this issue talks about short-circuit evaluation in `$or` and relates it to expressions that have a negative impact on query performance.

Though william.byrne says

I do not think there is a backend code change to help with this behaviour

I would disagree. You are presumably sorting the elements in order to improve the chances of matching a cached query plan where the sorted query itself is the cache key, and then using the sorted query elements in the actual plan. There are two possible approaches I can think of here:

1. Leave the sorting in place, but use ascending operation cost order instead of ascending alphabetical order. william.byrne points out that you do not know if a field is an array or not ahead of time. To me, that is another way of saying that you don't know the cost of the operation for a given field. Mongo could certainly keep a cache of discovered schema information obtained while executing queries (`x` is an array with sizes min-max, `y` is a scalar`, etc), and use it to derive a cost for non-index covered field operations that could then be used for the sorting.

2. Leave the sorting in place, but only for objects, not an arrays (i.e. the values of the `$and` and `$or` operators). Yes, this will increase the number of queries that need to have query plans calculated if the queries contain operators with array values that have varying element orders. However, for customers who can ensure a small cardinality of operator array element orders (such as those generating their queries programmatically) this suggested change would not have a significant negative impact on performance. Others would encounter a performance hit the first time a new formulation of a query is executed, but not on subsequent executions. This is not dissimilar to the hit people already deal with related to warming the cache.

Generated at Thu Feb 08 05:08:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.