Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-15086

Allow for efficient range queries over non-array fields in multikey indices

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: 2.4.11, 2.6.4
    • Fix Version/s: 3.3.8
    • Component/s: Performance, Querying
    • Labels:
      None
    • Backwards Compatibility:
      Major Change
    • Sprint:
      Query 16 (06/24/16)
    • Case:

      Description

      Issue Status as of Jun 06, 2016

      ISSUE SUMMARY
      In versions of MongoDB 3.2 and earlier, the metadata associated with an index contains a single bit to indicate whether or not the index is multikey. There is no granular information to indicate which indexed fields cause the index to be multikey. This has implications for compound multikey indexes because it prevents the query planner from using the index as efficiently as theoretically possible (i.e. given perfect knowledge about the structure of the documents in the index).

      Support for tracking which indexed fields cause the index to be multikey (aka "path-level multikey tracking") has been implemented in MongoDB 3.3.8. The query planner then uses this information to get tighter bounds on queries by assigning additional predicates to the index.

      SUPPORTED STORAGE ENGINES
      This feature is supported for the WiredTiger and InMemory storage engines only.

      This feature is not supported for the MMAPv1 storage engine. MMAPv1 support for path-level multikey information is tracked in SERVER-22727.

      IMPACT ON DOWNGRADE
      It is possible to downgrade from MongoDB 3.3.8 to MongoDB 3.2 only without first dropping all of the indexes that were created on MongoDB 3.3.8. However, it is only possible downgrade to MongoDB 3.2.7 or newer. Downgrading to MongoDB 3.2.7 will automatically delete the path-level multikey information from the index metadata. The indexes will retain only the index-level flag that indicates whether the index is multikey.

      Downgrading from MongoDB 3.3.8 to a version of MongoDB older than 3.2.7 may produce the following assertion message if indexes contain path-level multikey information:

      2016-05-31T13:18:11.090-0400 I -        [initandlisten] Assertion: 13111:wrong type for field (ns) 10 != 2
      

      See SERVER-23117 for more information on this downgrade requirement.

      TECHNICAL DETAILS

      Tighter bounds in 3.3.8+

      Using the path-level multikey information we track for an index, we can search a bounded interval (thus having intersected the bounds) on a field in the compound multikey index that doesn't ever refer to an array value contain multiple elements. In the following example, the "a" field is a scalar and the "b" field is an array value containing multiple elements.

      Index: {a: 1, b: 1}
      Document: {a: 5, b: [1, 2, 3]}
      ⇒ Multikey paths: ["b"]
      Query: {a: {$gte: 0, $lt: 10}}
      ⇒ Bounds on "a": [0, 10)
      ⇒ Bounds on "b": [MinKey, MaxKey]
      

      SERVER-22401 has additional examples of what bounds can be generated for compound multikey indexes given a document structure and query.

      Contrasted with 3.2 and earlier

      In MongoDB 3.2 and earlier, the query planner doesn't have enough information to know that the "a" field is never an array value in some document in the collection (e.g. that the document {a: [1, 2, 3], b: 5} doesn't also exist in the collection). Without this additional information, the query planner must assume that the {a: {$gte: 0}} condition could apply to a different element than the {a: {$lt: 10}} condition. Thus, we can only search a left-bounded or right-bounded interval on the "a" field and apply the other condition as a subsequent filter:

      {
          "stage" : "FETCH",
          "filter" : {
              "a" : {
                  "$gte" : 0
              }
          },
          "inputStage" : {
              "stage" : "IXSCAN",
              "keyPattern" : {
                  "a" : 1,
                  "b" : 1
              },
              "indexName" : "a_1_b_1",
              "isMultiKey" : true,
              ...
              "indexBounds" : {
                  "a" : [
                      "[-inf.0, 10.0)"
                  ],
                  "b" : [
                      "[MinKey, MaxKey]"
                  ]
              }
          }
      }
      

      Explain output in 3.3.8+

      The explain output for the COUNT_SCAN, DISTINCT_SCAN, and IXSCAN stages has been updated to include an additional "multiKeyPath" field for indexes that support this feature. The "multiKeyPath" field describes what prefixes of the indexed fields cause the index to be multikey:

      {
          "stage" : "IXSCAN",
          "keyPattern" : {
              "a" : 1,
              "b" : 1
          },
          "indexName" : "a_1_b_1",
          "isMultiKey" : true,
          "multiKeyPaths" : {
              "a" : [ ],
              "b" : [
                  "b"
              ]
          },
          ...
      }
      

      SERVER-23115 has additional information regarding the "mulitKeyPaths" field in the explain output.

      Updated description

      Suppose you have documents of the form

      {a: [{b: 1}, {b: 2}, {b: 3}]}
      {a: [{b: 4}, {b: 5}, {b: 6}]}
      

      with the index {"a.b": 1}. With this schema, there is currently no way to constrain the index bounds with multiple predicates. Users who need to do this generally must change their schema. Instead we could provide new query language semantics to support this use case. See the comments below or related tickets SERVER-8610, SERVER-7959, and SERVER-6050 for more details.

      Original Description

      It looks like there is unexpected slow performance on a range query when using $elemMatch. I first encountered it when using the aggregation framework and using the $match pipeline with $elemMatch, and have now backed it out into an example with $find.

      The performance hit is quite large, 20x slower or so, when I believe, if anything, the query should actually run FASTER, not 20x slower.

      I could try and troubleshoot more, but I don't totally understand the explain() output when doing ranges of dates. Sometimes explain() puts the indexBounds as 'true' in the explain(), and sometimes explain() put the indexBounds as 'ISODate("0NaN-NaN-NaNTNaN:NaN:NaNZ")', and I"m not 100% sure how to interpret that, other than I believe the equivalent is "no bound" (meaning the query is not bounding itself at the top or bottom of the range properly).

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                16 Vote for this issue
                Watchers:
                53 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: