[SERVER-48987] MongoDB is choosing the wrong index / execution plan. Created: 19/Jun/20  Updated: 16/Feb/23  Resolved: 16/Feb/23

Status: Closed
Project: Core Server
Component/s: Index Maintenance
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Anand Singh Assignee: Backlog - Query Optimization
Resolution: Cannot Reproduce Votes: 0
Labels: qopt-team
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Query Optimization
Operating System: ALL
Sprint: Query 2020-07-27, Query 2020-08-10, Query 2020-08-24, Query 2020-09-07
Participants:

 Description   

In load  MongoDB log file, we can see a lot of slow queries. Most of them are not using the best index and consequently the best execution plan is not used. However, when I run the same queries by myself in mongo shell, the correct index is used. So why for same query, we don't have same execution plan ?

Kindly find the log below using wrong index :

 

2020-05-25T04:55:51.624+0000 I COMMAND [conn301319] command mydb.mycoll command: aggregate { aggregate: "mycoll", pipeline: [ { $match: { networkId.$id: { $in: [ ObjectId('5e0ed9eb60b2533bda7a0fa8') ] }, status: "0", alarmType: "1" } }, { $group: { _id:{ networkId: "$networkId" }, alarmCount: { $sum: 1 } } } ] } planSummary: IXSCAN { status: 1 } keysExamined:35350 docsExamined:35350 numYields:280 nreturned:0 reslen:135 locks:{ Global: { acquireCount:{ r: 574 }, acquireWaitCount: { r: 92 }, timeAcquiringMicros: \{ r: 1590945 } }, Database: { acquireCount:{ r: 287 }}, Collection: { acquireCount:{ r: 286 }} } protocol:op_query 1980ms
 

Index On collection:
 

db.getCollection('mycoll').getIndexes()
 [
 {
 "v" : 1,
 "key" :{ "_id" : 1 },
 "name" : "_id_",
 "ns" : "mydb.mycoll"
 },
 
{
 "v" : 1,
 "key" :{ "networkId.$id" : 1, "status" : 1, "alarmType" : 1 },
 "name" : "networkId.$id_1_status_1_alarmType_1",
 "ns" : "mydb.mycoll"
 },
 
{
 "v" : 2,
 "key" :{ "status" : 1 },
 "name" : "status_1",
 "ns" : "mydb.mycoll",
 "background" : true
 }
 ]
 



 Comments   
Comment by Chris Harris [ 16/Feb/23 ]

Hi anansingh@netgear.com - since we haven't heard back from you in a while we are going to close this ticket. If you are able to collect the requested information to allow us to continue troubleshooting, please feel free to update the ticket with it.

Best,
Chris

Comment by Chris Harris [ 11/Dec/20 ]

Hi anansingh@netgear.com,

Apologies for the delayed response.  My name is Chris and I am happy to assist.

Having reviewed all of the data that you have provided (thank you very much for doing so), we believe that the overall behavior that you are experiencing is described and being tracked by other tickets.  Most notably is SERVER-21697 which has the title "Plan ranking should take query and index keys into consideration for breaking ties".

This conclusion is based on the statistics provided in the output from the plan cache.  In it we can see that the plans both generated the same score despite the fact that the single field index required a filter in the FETCH stage while the plan using the compound index did not.  

This is related to the fact that plan selection is done by sampling the candidate plans against a subset of the data.  In your particular situation, it appears that all of the documents in the tested data that had a status of "0" also happened to match the other two predicates as well (a "networkId.$id" of ObjectId('5e0ee74a60b2533bda9541cd') and an alarmType of "1").  This potentially skewed data helped contribute to the ambiguity in the plan selection process and allowed for the single field index to win and get cached.

One potential solution in your situation may be to rebuild the compound index with the status key first and then simply drop the single field index that currently exists.  The new index would still support any query that only filters on the status field while also servicing the query of concern in this ticket that contains all three fields in the query predicate.  We would also encourage you to vote for and follow SERVER-21697 for future updates regarding that specific behavior. 

It is also important to mention that despite the fact that the single field index wins the query planning processes (and is cached) at times, we normally would not anticipate that it would have a significant or lasting impact on your system.  The database includes a replanning mechanism which it uses to monitor and update plans in the cache as needed.  Based on the design of that feature and the evidence provided so far in this ticket, it does seem peculiar that replanning did not happen while executing that particular operation. 

There are a variety of explanations that could account for this, such as the possibility that the mongod executing the operation (seemingly the PRIMARY as there is no readPreference specified in the log entry) does not have the same set of indexes that the SECONDARY member has (where the explain and plan cache stats came from).  That said, if you are able to reproduce the situation where the aggregation both produces no results but scans thousands of index keys then we would need you to perform the following steps to allow us to investigate further:

  • Execute the following commands to increase the log verbosity while connected to the impacted mongod node via the mongo shell:

db.setProfilingLevel(0,-1)
db.setLogLevel(5,'query')

  • Run the application to trigger the behavior from the database.
  • Reset the logging thresholds back to their defaults, which can be done with these commands:

db.setProfilingLevel(0,100)
db.setLogLevel(-1,'query')

We hope that this information is helpful.  Please let us know how you would like to proceed.

Best,
Chris

Comment by Anand Singh [ 08/Oct/20 ]

Hi All ,

please update on the progress .

Comment by Anand Singh [ 01/Sep/20 ]

Hi All ,

any update /progress on the issue .

Comment by Anand Singh [ 05/Aug/20 ]

Hi james.wahlin ,

Kindly find the response in same order .

1.We have approx 4-5 unique value for 'networkId.$id' in the collection .

2.Yes 'networkId.$id' is DBref in the collection .

3.We are using spring-data-mongodb-version>2.2.7

4.No it is not using COLLSCAN .In that case it is using Right index .

5.If above information is not sufficient for you debugging then we need to ask security/legal team and let you know that we can share the log or not on   private upload mechanism.

Let us know if any other information required to debug this issue.

Comment by James Wahlin [ 29/Jul/20 ]

Hi anansingh@netgear.com,

Sorry for the delay in getting back to you. At this point it is not clear what the root cause for this issue is. We would like to ask you for additional information to help us narrow our investigation. Can please:
1) Give us an idea on how selective the "networkId.$id" field is? Out of the 35350 documents scanned, how many could potentially share the same value? Running the following may help give us a feel:

mydb.mycoll.aggregate([ { $group: { _id:{ networkId: "$networkId" }, alarmCount: { $sum: 1 } } }, {$sort: {alarmCount: -1}}, {$limit: 5} ])

2) Can you confirm whether you are using DBRefs? We suspect so, given the "$id" field name. We generally discourage use of DBRefs if using and in a position where you can change, and recommend manual references instead. Manual references provide greater flexibility and when used with the $lookup aggregation stage allow for server-side join.
3) Can you tell us what driver you are using and the version?
4) If in a test environment, if you drop the "status_1" index what behavior do you see? Does this result in collection scan (COLLSCAN) being chosen?
5) If we were to provide a private upload mechanism, would you be willing to share logs and/or collection data with us?

Thanks,
James

Comment by Anand Singh [ 20/Jul/20 ]

Hi All ,

Keep posted here so we are all on same page for status of above problem .

Comment by Anand Singh [ 08/Jul/20 ]

Hi All ,

Any update on the above issue .

Comment by Anand Singh [ 26/Jun/20 ]

Hi carl.champain ,

Kindly find the output of explain and query plan cache  below .Due to privacy / security  reason we are not bale to share the logs now .Can we get the solution/Suggestion based on the below output or logs are required ?  

rs0:SECONDARY> db.getCollection('myColl').find(\{ 'networkId.$id': { '$in': [ ObjectId('5e0ee74a60b2533bda9541cd') ] }, 'status': "0", 'alarmType': "1" }).explain('executionStats')
{
 "queryPlanner" : {
 "plannerVersion" : 1,
 "namespace" : "mydb.myColl",
 "indexFilterSet" : false,
 "parsedQuery" : {
 "$and" : [
 {
 "alarmType" : {
 "$eq" : "1"
 }
 },
 {
 "networkId.$id" : {
 "$eq" : ObjectId("5e0ee74a60b2533bda9541cd")
 }
 },
 {
 "status" : {
 "$eq" : "0"
 }
 }
 ]
 },
 "winningPlan" : {
 "stage" : "FETCH",
 "filter" : {
 "$and" : [
 {
 "alarmType" : {
 "$eq" : "1"
 }
 },
 {
 "networkId.$id" : {
 "$eq" : ObjectId("5e0ee74a60b2533bda9541cd")
 }
 }
 ]
 },
 "inputStage" : {
 "stage" : "IXSCAN",
 "keyPattern" : {
 "status" : 1
 },
 "indexName" : "status_1",
 "isMultiKey" : false,
 "multiKeyPaths" : {
 "status" : [ ]
 },
 "isUnique" : false,
 "isSparse" : false,
 "isPartial" : false,
 "indexVersion" : 2,
 "direction" : "forward",
 "indexBounds" : {
 "status" : [
 "[\"0\", \"0\"]"
 ]
 }
 }
 },
 "rejectedPlans" : [
 {
 "stage" : "FETCH",
 "inputStage" : {
 "stage" : "IXSCAN",
 "keyPattern" : {
 "networkId.$id" : 1,
 "status" : 1,
 "alarmType" : 1
 },
 "indexName" : "networkId.$id_1_status_1_alarmType_1",
 "isMultiKey" : false,
 "multiKeyPaths" : {
 "networkId.$id" : [ ],
 "status" : [ ],
 "alarmType" : [ ]
 },
 "isUnique" : false,
 "isSparse" : false,
 "isPartial" : false,
 "indexVersion" : 1,
 "direction" : "forward",
 "indexBounds" : {
 "networkId.$id" : [
 "[ObjectId('5e0ee74a60b2533bda9541cd'), ObjectId('5e0ee74a60b2533bda9541cd')]"
 ],
 "status" : [
 "[\"0\", \"0\"]"
 ],
 "alarmType" : [
 "[\"1\", \"1\"]"
 ]
 }
 }
 }
 
 ]
 },
 "executionStats" : {
 "executionSuccess" : true,
 "nReturned" : 22030,
 "executionTimeMillis" : 98,
 "totalKeysExamined" : 22030,
 "totalDocsExamined" : 22030,
 "executionStages" : {
 "stage" : "FETCH",
 "filter" : {
 "$and" : [
 {
 "alarmType" : {
 "$eq" : "1"
 }
 },
 {
 "networkId.$id" : {
 "$eq" : ObjectId("5e0ee74a60b2533bda9541cd")
 }
 }
 ]
 },
 "nReturned" : 22030,
 "executionTimeMillisEstimate" : 20,
 "works" : 22031,
 "advanced" : 22030,
 "needTime" : 0,
 "needYield" : 0,
 "saveState" : 175,
 "restoreState" : 175,
 "isEOF" : 1,
 "invalidates" : 0,
 "docsExamined" : 22030,
 "alreadyHasObj" : 0,
 "inputStage" : {
 "stage" : "IXSCAN",
 "nReturned" : 22030,
 "executionTimeMillisEstimate" : 10,
 "works" : 22031,
 "advanced" : 22030,
 "needTime" : 0,
 "needYield" : 0,
 "saveState" : 175,
 "restoreState" : 175,
 "isEOF" : 1,
 "invalidates" : 0,
 "keyPattern" : {
 "status" : 1
 },
 "indexName" : "status_1",
 "isMultiKey" : false,
 "multiKeyPaths" : {
 "status" : [ ]
 },
 "isUnique" : false,
 "isSparse" : false,
 "isPartial" : false,
 "indexVersion" : 2,
 "direction" : "forward",
 "indexBounds" : {
 "status" : [
 "[\"0\", \"0\"]"
 ]
 },
 "keysExamined" : 22030,
 "seeks" : 1,
 "dupsTested" : 0,
 "dupsDropped" : 0,
 "seenInvalidated" : 0
 }
 }
 },
 "serverInfo" : {
 "host" : "Node3",
 "port" : 27017,
 "version" : "3.6.18",
 "gitVersion" : "2005f25eed7ed88fa698d9b800fe536bb0410ba4"
 },
 "ok" : 1
}

db.getCollection('myColl').getPlanCache().getPlansByQuery(\{ "networkId.$id": { $in: [ ObjectId('5e0ee74a60b2533bda9541cd') ] }, "status": "0", "alarmType": "1" })
[
 {
 "details" : {
 "solution" : "(index-tagged expression tree: tree=Node\n---Leaf \n---Leaf \n---Leaf status_1, pos: 0, can combine? 1\n)"
 },
 "reason" : {
 "score" : 2.0003,
 "stats" : {
 "stage" : "FETCH",
 "filter" : {
 "$and" : [
 {
 "alarmType" : {
 "$eq" : "1"
 }
 },
 {
 "networkId.$id" : {
 "$eq" : ObjectId("5e0ee74a60b2533bda9541cd")
 }
 }
 ]
 },
 "nReturned" : 101,
 "executionTimeMillisEstimate" : 0,
 "works" : 101,
 "advanced" : 101,
 "needTime" : 0,
 "needYield" : 0,
 "saveState" : 3,
 "restoreState" : 3,
 "isEOF" : 0,
 "invalidates" : 0,
 "docsExamined" : 101,
 "alreadyHasObj" : 0,
 "inputStage" : {
 "stage" : "IXSCAN",
 "nReturned" : 101,
 "executionTimeMillisEstimate" : 0,
 "works" : 101,
 "advanced" : 101,
 "needTime" : 0,
 "needYield" : 0,
 "saveState" : 3,
 "restoreState" : 3,
 "isEOF" : 0,
 "invalidates" : 0,
 "keyPattern" : {
 "status" : 1
 },
 "indexName" : "status_1",
 "isMultiKey" : false,
 "multiKeyPaths" : {
 "status" : [ ]
 },
 "isUnique" : false,
 "isSparse" : false,
 "isPartial" : false,
 "indexVersion" : 2,
 "direction" : "forward",
 "indexBounds" : {
 "status" : [
 "[\"0\", \"0\"]"
 ]
 },
 "keysExamined" : 101,
 "seeks" : 1,
 "dupsTested" : 0,
 "dupsDropped" : 0,
 "seenInvalidated" : 0
 }
 }
 },
 "feedback" : {
 "nfeedback" : 20,
 "scores" : [
 {
 "score" : 2.0003
 },
 {
 "score" : 2.0003
 },
 {
 "score" : 2.0003
 },
 {
 "score" : 2.0003
 },
 {
 "score" : 2.0003
 },
 {
 "score" : 2.0003
 },
 {
 "score" : 2.0003
 },
 {
 "score" : 2.0003
 },
 {
 "score" : 2.0003
 },
 {
 "score" : 2.0003
 },
 {
 "score" : 2.0003
 },
 {
 "score" : 2.0003
 },
 {
 "score" : 2.0003
 },
 {
 "score" : 2.0003
 },
 {
 "score" : 2.0003
 },
 {
 "score" : 2.0003
 },
 {
 "score" : 2.0003
 },
 {
 "score" : 2.0003
 },
 {
 "score" : 2.0003
 },
 {
 "score" : 2.0003
 }
 ]
 },
 "filterSet" : false
 },
 
 {
 "details" : {
 "solution" : "(index-tagged expression tree: tree=Node\n---Leaf networkId.$id_1_status_1_alarmType_1, pos: 2, can combine? 1\n---Leaf networkId.$id_1_status_1_alarmType_1, pos: 0, can combine? 1\n---Leaf networkId.$id_1_status_1_alarmType_1, pos: 1, can combine? 1\n)"
 },
 "reason" : {
 "score" : 2.0003,
 "stats" : {
 "stage" : "FETCH",
 "nReturned" : 101,
 "executionTimeMillisEstimate" : 0,
 "works" : 101,
 "advanced" : 101,
 "needTime" : 0,
 "needYield" : 0,
 "saveState" : 3,
 "restoreState" : 3,
 "isEOF" : 0,
 "invalidates" : 0,
 "docsExamined" : 101,
 "alreadyHasObj" : 0,
 "inputStage" : {
 "stage" : "IXSCAN",
 "nReturned" : 101,
 "executionTimeMillisEstimate" : 0,
 "works" : 101,
 "advanced" : 101,
 "needTime" : 0,
 "needYield" : 0,
 "saveState" : 3,
 "restoreState" : 3,
 "isEOF" : 0,
 "invalidates" : 0,
 "keyPattern" : {
 "networkId.$id" : 1,
 "status" : 1,
 "alarmType" : 1
 },
 "indexName" : "networkId.$id_1_status_1_alarmType_1",
 "isMultiKey" : false,
 "multiKeyPaths" : {
 "networkId.$id" : [ ],
 "status" : [ ],
 "alarmType" : [ ]
 },
 "isUnique" : false,
 "isSparse" : false,
 "isPartial" : false,
 "indexVersion" : 1,
 "direction" : "forward",
 "indexBounds" : {
 "networkId.$id" : [
 "[ObjectId('5e0ee74a60b2533bda9541cd'), ObjectId('5e0ee74a60b2533bda9541cd')]"
 ],
 "status" : [
 "[\"0\", \"0\"]"
 ],
 "alarmType" : [
 "[\"1\", \"1\"]"
 ]
 },
 "keysExamined" : 101,
 "seeks" : 1,
 "dupsTested" : 0,
 "dupsDropped" : 0,
 "seenInvalidated" : 0
 }
 }
 },
 "feedback" : {
 
 },
 "filterSet" : false
 }
 
]

 

Comment by Carl Champain (Inactive) [ 25/Jun/20 ]

Hi anansingh@netgear.com,

We would still like to see the requested information above.

Thank you,
Carl

Comment by Anand Singh [ 25/Jun/20 ]

HIi carl.champain ,

Before sharing log life i want to let you now one thing that in all three places wrong index   used .

In explain ,in query plan cache and in mongod.log file .

Kindly let us know if you still want all the output and log file .

Comment by Carl Champain (Inactive) [ 24/Jun/20 ]

Thanks for getting back to us, anansingh@netgear.com.

To help us understand what's happening, can you please provide:

Kind regards,
Carl

Comment by Anand Singh [ 24/Jun/20 ]

Hi carl.champain ,

I have a look on  SERVER-32452 but in my case keysExamined, docsExamined, and number of record all are equal .For example keysExamined:35350 ,docsExamined:35350 and db.mycoll.count() is also 35350.

Comment by Carl Champain (Inactive) [ 22/Jun/20 ]

Hi anansingh@netgear.com,

Thank you for the report.

It seems that you are experiencing SERVER-32452 and we would recommend reviewing it for workarounds. The ticket describes a fix that may help here, but isn't the complete story. We're continuing to actively improve our query planner, see SERVER-48228, as a recent example.

Let us know if SERVER-32452 is not the issue you're experiencing.

Kind regards,
Carl

 

Comment by Anand Singh [ 19/Jun/20 ]

We have upgraded mongoDb to 3.6 and we are still seeing this issue .

Generated at Thu Feb 08 05:18:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.