[SERVER-56093] A distinct query on a nested field becomes very slow when adding a multikey index covering the query Created: 14/Apr/21  Updated: 17/May/21  Resolved: 17/May/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.2.8
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Cyril Assignee: Edwin Zhou
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

MongoDB v4.2.8


Attachments: PNG File Screenshot 2021-04-14 at 12.30.56.png    
Operating System: ALL
Steps To Reproduce:

I am not sure what would be the minimum scenario to reproduce, but it seems like

  • an event model of some properties like user_id, and name, time, and a nested property nested.x
  • A few million documents in the db
  • a compound index including those fields and the nested field
  • a distinct query on that nested field nested.x (along with some filters on other fields) covered by the above index

Could reproduce the problem 

Participants:

 Description   

We ran into an issue we can't explain, when adding a covering index on a distinct query caused huge performance issues, and we observed the CPU of our instances going to 100%. Amongst other metrics, we weren't able to find anything that seemed to be correlated to the problem (swap, page faults, etc seemed as usual, but maybe we weren't looking at the right things) 

Here is a sample event document. Note the nested properties.id field that we are performing a distinct() on

{
  "_id"=>ObjectId('6076c00c63aa1f0968fa5240'),
  "name"=>"Viewed profile",
  "properties"=>{
    "id"=>"5d31acbec00a89296d164e90", 
    "navigation_referrer"=>"WISHLIST", 
    "couple_id"=>"{hex}"
  },
 "time"=>"2021-04-14T10:12:27.752Z",
 "ip"=>"195.83.xxx.xx",
 "referrer"=>"{string}",
 "visit_id"=>ObjectId('6076bef8ffdc312fe32aa6d2')
 "user_id"=>ObjectId('5899d66584b2523570ec4993')
 )
}

of a collection of about 20 million documents

We perform the following query

db.ahoy_events.explain().distinct("properties.id", {name: "Viewed profile", user_id: ObjectId("5899d66584b2523570ec4993")})

Previously, we had no covering index on this type of query, and the selected plan was

"winningPlan" : {
  "stage" : "FETCH",
  "inputStage" : {
    "stage" : "IXSCAN",
    "keyPattern" : {
      "name" : 1,
      "user_id" : 1,
      "time" : -1
    },
    "indexName" : "Event name with user ID by time",
    "isMultiKey" : false,
    "multiKeyPaths" : {
      "name" : [ ],
      "user_id" : [ ],
      "time" : [ ]
    },
    "isUnique" : false,
    "isSparse" : false,
    "isPartial" : false,
    "indexVersion" : 2,
    "direction" : "forward",
    "indexBounds" : {
      "name" : [
        "[\"Viewed profile\", \"Viewed profile\"]"
      ],
      "user_id" : [
        "["ObjectId('5899d66584b2523570ec4993')", "ObjectId('5899d66584b2523570ec4993')"]"
      ],
      "time" : [
        "["MaxKey", "MinKey"]"
      ]
    }
  }
},

Using an index defined for other purposes (1)

{ "name" : 1, "user_id" : 1, "time" : -1 } # (1)  

But the request was still fast to the extent or retrieving documents in <1sec

After creating the following index (2)

{ "name" : 1, "properties.id" : 1, "time" : -1, "user_id" : 1 }   # (2)

the winning plan becomes

"winningPlan" : {
  "stage" : "PROJECTION_DEFAULT",
  "transformBy" : {
    "_id" : 0,
    "properties.id" : 1
  },
  "inputStage" : {
    "stage" : "DISTINCT_SCAN",
    "keyPattern" : {
      "name" : 1,
      "properties.id" : 1,
      "time" : -1,
      "user_id" : 1
    },
    "indexName" : "user_id by name properties.id and time",
    "isMultiKey" : false,
    "multiKeyPaths" : {
      "name" : [ ],
      "properties.id" : [ ],
      "time" : [ ],
      "user_id" : [ ]
    },
    "isUnique" : false,
    "isSparse" : false,
    "isPartial" : true,
    "indexVersion" : 2,
    "direction" : "forward",
    "indexBounds" : {
      "name" : [
        "[\"Viewed profile\", \"Viewed profile\"]"
      ],
      "properties.id" : [
        "["MinKey", "MaxKey"]"
      ],
      "time" : [
        "["MaxKey", "MinKey"]"
      ],
      "user_id" : [
        "["ObjectId('5899d66584b2523570ec4993')", "ObjectId('5899d66584b2523570ec4993')"]"
      ]
    }
  }

And the request becomes very slow as shown in our graphs (the query can take up to several minutes when the server is loaded, and a dozen seconds when it is the only query)

Furthermore, we tried creating another index that matched exactly the query (3)

{ "name" : 1, "user_id" : 1, "properties.id" : 1 }  # (3)

The result was that this index was still not selected as a winning plan, and mongoDB kept using the previous index with the extra "time: -1" in the multikey definition (index (2)).

After removing the index (2)

{ "name" : 1, "properties.id" : 1, "time" : -1, "user_id" : 1 }  

our new index (3) is selected with a winning plan involving a DISTINCT_SCAN and works fast as expected

So several questions

  • What could explain that the new index (2) was so slow despite covering the query ? Is it a bug ?
  • What could explain why a more appropriate / shorter index (3) was not selected in favor of (2)


 Comments   
Comment by Edwin Zhou [ 17/May/21 ]

Hi CyrilDD,

We haven’t heard back from you for some time, so I’m going to close this ticket. Please let us know if this is still an issue for you, or if you have any remaining questions and I can reopen this ticket!

Best,
Edwin

Comment by Edwin Zhou [ 07/May/21 ]

Hi CyrilDD,

We still need additional information to diagnose the problem. If this is still an issue for you, please let us know if my previous comment clarifies your reported behavior!

Thanks,
Edwin

Comment by Edwin Zhou [ 21/Apr/21 ]

Hi CyrilDD,

Thanks for your detailed report. I'd first like to clarify that a multikey index is not in use here because none of the fields in your example document is an array. You may be referring to indexing on an embedded field.

It appears that you're attempting to use a compound index (2) that is incompatible with the query field in your distinct command. Note that "properties.id" is the field for which distinct values are returned, but is not a part of the query. The query in question:

{name: "Viewed profile", user_id: ObjectId("5899d66584b2523570ec4993")}

can use compound indexes if the index is prefixed with name and user_id respectively. Both

{ "name" : 1, "user_id" : 1, "time" : -1 } # (1)  
{ "name" : 1, "user_id" : 1, "properties.id" : 1 }  # (3)

are compatible since they begin with {"name": 1, "user_id": 1}.

This index:

{ "name" : 1, "properties.id" : 1, "time" : -1, "user_id" : 1 }   # (2)

is incompatible because the prefix doesn't match the query field. You can read more about compound query prefixes in our documentation.

Does this clarify your reported behavior?

Best,
Edwin

Generated at Thu Feb 08 05:38:18 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.