Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-56093

A distinct query on a nested field becomes very slow when adding a multikey index covering the query

    • Type: Icon: Bug Bug
    • Resolution: Incomplete
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 4.2.8
    • Component/s: None
    • Labels:
      None
    • Environment:
      MongoDB v4.2.8
    • ALL
    • Hide

      I am not sure what would be the minimum scenario to reproduce, but it seems like

      • an event model of some properties like user_id, and name, time, and a nested property nested.x
      • A few million documents in the db
      • a compound index including those fields and the nested field
      • a distinct query on that nested field nested.x (along with some filters on other fields) covered by the above index

      Could reproduce the problem 

      Show
      I am not sure what would be the minimum scenario to reproduce, but it seems like an event model of some properties like user_id, and name, time, and a nested property nested.x A few million documents in the db a compound index including those fields and the nested field a distinct query on that nested field nested.x (along with some filters on other fields) covered by the above index Could reproduce the problem 

      We ran into an issue we can't explain, when adding a covering index on a distinct query caused huge performance issues, and we observed the CPU of our instances going to 100%. Amongst other metrics, we weren't able to find anything that seemed to be correlated to the problem (swap, page faults, etc seemed as usual, but maybe we weren't looking at the right things) 

      Here is a sample event document. Note the nested properties.id field that we are performing a distinct() on

      Unable to find source-code formatter for language: mongo. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml
      {
        "_id"=>ObjectId('6076c00c63aa1f0968fa5240'),
        "name"=>"Viewed profile",
        "properties"=>{
          "id"=>"5d31acbec00a89296d164e90", 
          "navigation_referrer"=>"WISHLIST", 
          "couple_id"=>"{hex}"
        },
       "time"=>"2021-04-14T10:12:27.752Z",
       "ip"=>"195.83.xxx.xx",
       "referrer"=>"{string}",
       "visit_id"=>ObjectId('6076bef8ffdc312fe32aa6d2')
       "user_id"=>ObjectId('5899d66584b2523570ec4993')
       )
      }

      of a collection of about 20 million documents

      We perform the following query

      Unable to find source-code formatter for language: mongo. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml
      db.ahoy_events.explain().distinct("properties.id", {name: "Viewed profile", user_id: ObjectId("5899d66584b2523570ec4993")})
      

      Previously, we had no covering index on this type of query, and the selected plan was

      Unable to find source-code formatter for language: mongo. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml
      "winningPlan" : {
        "stage" : "FETCH",
        "inputStage" : {
          "stage" : "IXSCAN",
          "keyPattern" : {
            "name" : 1,
            "user_id" : 1,
            "time" : -1
          },
          "indexName" : "Event name with user ID by time",
          "isMultiKey" : false,
          "multiKeyPaths" : {
            "name" : [ ],
            "user_id" : [ ],
            "time" : [ ]
          },
          "isUnique" : false,
          "isSparse" : false,
          "isPartial" : false,
          "indexVersion" : 2,
          "direction" : "forward",
          "indexBounds" : {
            "name" : [
              "[\"Viewed profile\", \"Viewed profile\"]"
            ],
            "user_id" : [
              "["ObjectId('5899d66584b2523570ec4993')", "ObjectId('5899d66584b2523570ec4993')"]"
            ],
            "time" : [
              "["MaxKey", "MinKey"]"
            ]
          }
        }
      },

      Using an index defined for other purposes (1)

      { "name" : 1, "user_id" : 1, "time" : -1 } # (1)  
      

      But the request was still fast to the extent or retrieving documents in <1sec

      After creating the following index (2)

      { "name" : 1, "properties.id" : 1, "time" : -1, "user_id" : 1 }   # (2)
      

      the winning plan becomes

      Unable to find source-code formatter for language: mongo. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml
      "winningPlan" : {
        "stage" : "PROJECTION_DEFAULT",
        "transformBy" : {
          "_id" : 0,
          "properties.id" : 1
        },
        "inputStage" : {
          "stage" : "DISTINCT_SCAN",
          "keyPattern" : {
            "name" : 1,
            "properties.id" : 1,
            "time" : -1,
            "user_id" : 1
          },
          "indexName" : "user_id by name properties.id and time",
          "isMultiKey" : false,
          "multiKeyPaths" : {
            "name" : [ ],
            "properties.id" : [ ],
            "time" : [ ],
            "user_id" : [ ]
          },
          "isUnique" : false,
          "isSparse" : false,
          "isPartial" : true,
          "indexVersion" : 2,
          "direction" : "forward",
          "indexBounds" : {
            "name" : [
              "[\"Viewed profile\", \"Viewed profile\"]"
            ],
            "properties.id" : [
              "["MinKey", "MaxKey"]"
            ],
            "time" : [
              "["MaxKey", "MinKey"]"
            ],
            "user_id" : [
              "["ObjectId('5899d66584b2523570ec4993')", "ObjectId('5899d66584b2523570ec4993')"]"
            ]
          }
        }
      

      And the request becomes very slow as shown in our graphs (the query can take up to several minutes when the server is loaded, and a dozen seconds when it is the only query)

      Furthermore, we tried creating another index that matched exactly the query (3)

      { "name" : 1, "user_id" : 1, "properties.id" : 1 }  # (3)
      

      The result was that this index was still not selected as a winning plan, and mongoDB kept using the previous index with the extra "time: -1" in the multikey definition (index (2)).

      After removing the index (2)

      { "name" : 1, "properties.id" : 1, "time" : -1, "user_id" : 1 }  
      

      our new index (3) is selected with a winning plan involving a DISTINCT_SCAN and works fast as expected

      So several questions

      • What could explain that the new index (2) was so slow despite covering the query ? Is it a bug ?
      • What could explain why a more appropriate / shorter index (3) was not selected in favor of (2)

            Assignee:
            edwin.zhou@mongodb.com Edwin Zhou
            Reporter:
            CyrilDD Cyril
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: