Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-45364

Query Planner should estimate cost of each predicate

    • Type: Icon: Improvement Improvement
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Querying
    • Labels:
    • Query Optimization

      This can be used to choose a more efficient order of execution for predicates like $or and $elemMatch.

      Original Title

      The Query Parser reorders $elemMatch ahead of equality conditions when evaluating a $or set.

      Original Description

      I created a collection with 200 documents, all in this format:

        { _id: ObjectId,
          aphaField: true,
          giantArray: [array of 20,000 documents like {ae: intValue}]
          zuluField: true }
      

      The values for giantArray.ae are random values, but the first 19,999 are all > 17.
      The last one has value 11.

      Then I ran find(qry).count() against the collection, where qry was each of these filters:
      1. {aphaField: {$eq: true}}
      2. {giantArray: {$elemMatch: {ae: {$eq: 11}}}}
      3. {zuluField: {$eq: true}}

      I ran each query 21 times. The first time was to warm the cache, and then I recorded how long the next 20 executions took. All queries used a COLLSCAN as the collection has no indexes beyond the default one on _id, and all scanned all 100 documents.

      The equality matches on the boolean fields alone ({aphaField: {$eq: true}} and {zuluField: {$eq: true}}) executed 20 times in 8 milliseconds. The $elemMatch condition on the nested array field "giantArray.ae" alone was slower, with 20 executions taking 4,616 milliseconds.

      This is not unexpected, as the singleton boolean fields are very fast to test, whereas all 20,000 array members had to be examined to find the one with value 11.

      <br>

      Next I joined the equality tests on the boolean fields and the $elemMatch condition within a $or, both as {$or: [X, Y]} and {$or: [Y, X]}, and ran them 21 times each too:

      • {$or: [\{aphaField: {$eq: true}}, \{zuluField: {$eq: true}}]}
      • {$or: [\{aphaField: {$eq: true}}, \{giantArray: {$elemMatch: {ae: {$eq: 11}}}}]}
      • {$or: [\{giantArray: {$elemMatch: {ae: {$eq: 11}}}}, \{aphaField: {$eq: true}}]}
      • {$or: [\{giantArray: {$elemMatch: {ae: {$eq: 11}}}}, \{zuluField: {$eq: true}}]}
      • {$or: [/\{zuluField: /{$eq: true/}/}, /\{aphaField: /{$eq: true/}/}]/}
      • {$or: [\{zuluField: {$eq: true}}, \{giantArray: {$elemMatch: {ae: {$eq: 11}}}}]}

      When X and Y were simple equality tests on boolean fields (i.e. no $elemMatch), both {$or: [X, Y]} and {$or: [Y, X]} executed 20 times in 8 to 10 milliseconds. However, when either X or Y was the $elemMatch condition, the Query Planner parser reordered THAT condition to be FIRST in the evaluation of the $or. That resulted in 20 executions of any query involving $elemMatch taking between 4568 and 4915 milliseconds.

      SERVER-45308) shows that the Query Planner parser reorders conditions alphabetically, but $elemMatch is clearly a special case, perhaps from a different phase in the parsing. The effect of putting the $elemMatch conditions first was poorer performance because conditions are evaluated in that order too.

      Evaluating a condition against an array field will typically take longer than one on a simple field, because the array will generally have multiple values to be tested. If the condition on the boolean field had been evaluated first these queries would have completed in much less time.

      Conclusions
      ===========
      In SERVER-45308) my conclusion was that no backend code change could help, as due to MongoDB's flexible schema the Query Parser cannot tell that {{

      {"giantArray.ae": 11}

      }} is a condition on an array field and not just a nested sub-field.

      In contrast, when the query filter employs $elemMatch it should be obvious to the Query Parser that the field involved is an array of documents (or the end user would not have used that operator).

      I suggest that the Query Parser be changed to order $elemMatch conditions LAST within an $or set. Overall I believe this will result in better performance.

        1. elemTest.js
          3 kB
          William Byrne III
        2. elemTest.out.compact
          2 kB
          William Byrne III

            Assignee:
            backlog-query-optimization [DO NOT USE] Backlog - Query Optimization
            Reporter:
            william.byrne@mongodb.com William Byrne III
            Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

              Created:
              Updated: