[SERVER-32539] Suboptimal cached query plan may not be evicted for multi-key sort Created: 04/Jan/18  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Querying
Affects Version/s: 3.4.10
Fix Version/s: None

Type: Bug Priority: Minor - P4
Reporter: Jonathan Roy Assignee: Backlog - Query Execution
Resolution: Unresolved Votes: 0
Labels: query-44-grooming
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Server 3.4.10 on Ubuntu 16.04.3 LTS


Issue Links:
Related
is related to SERVER-32452 Replanning may not occur when a plan ... Closed
Assigned Teams:
Query Execution
Operating System: ALL
Participants:

 Description   
Original Summary

Sorting used multi-key in one direction only, explain() provides wrong info

Original Description

I was looking at the slow query log tonight and decided to try to optimize some of them. One simple query was:

db.games.find({ ip: "1.2.3.4" }, { _id: 0, sgfhash: 1} ).sort({ _id: -1 }).limit(1);

to get the most recent itme with that IP address. Turns out this was taking 4-10 seconds. So I added a multi-index key for ip,_id so it could use that to optimize the search/sort. After much hair pulling, I finally realized the sort _id: 1 direction was instant, but the _id: -1 was the slow one. I tried changing my index to be ip:1,_id:-1 and ip:-1,_id:-1 with no help. Sort '1' was always fast, '-1' always slow.

I wrote about this all here as well when at my wit's end:

https://www.reddit.com/r/mongodb/comments/7nyx9b/using_indexes_to_reverse_sort_a_query/

What I found is that the optimizer was only using the proper key when sorting with '1' and never when sorting with '-1'. explain() mislead me however. Explain would say this:

"winningPlan" : {
                    "stage" : "LIMIT",
                    "limitAmount" : 1,
                    "inputStage" : {
                            "stage" : "PROJECTION",
                            "transformBy" : {
                                    "_id" : 0,
                                    "sgfhash" : 1
                            },
                            "inputStage" : {
                                    "stage" : "FETCH",
                                    "inputStage" : {
                                            "stage" : "IXSCAN",
                                            "keyPattern" : {
                                                    "ip" : -1,
                                                    "_id" : -1
                                            },
                                            "indexName" : "ip_-1__id_-1",
                                            "isMultiKey" : false,
                                            "multiKeyPaths" : {
                                                    "ip" : [ ],
                                                    "_id" : [ ]
                                            },
                                            "isUnique" : false,
                                            "isSparse" : false,
                                            "isPartial" : false,
                                            "indexVersion" : 2,
                                            "direction" : "forward",
                                            "indexBounds" : {
                                                    "ip" : [
                                                            "[\"1.2.3.4\", \"1.2.3.4\"]"
                                                    ],
                                                    "_id" : [
                                                            "[MaxKey, MinKey]"
                                                    ]
                                            }
                                    }
                            }
                    }
            }

Only changing "direction" to "backwards" if I tried the query in the opposite sort direction.

However, I turned on profiling and ran the slow sort direction, and I get this:

{
    "op" : "query",
    "ns" : "test.games",
    "query" : {
            "find" : "games",
            "filter" : {
                    "ip" : "1.2.3.4"
            },
            "limit" : 1,
            "singleBatch" : false,
            "sort" : {
                    "_id" : -1
            },
            "projection" : {
                    "_id" : 0,
                    "sgfhash" : 1
            }
    },
    "keysExamined" : 534,
    "docsExamined" : 534,
    "hasSortStage" : true,
    "cursorExhausted" : true,
    "numYield" : 215,
    "locks" : {
            "Global" : {
                    "acquireCount" : {
                            "r" : NumberLong(432)
                    }
            },
            "Database" : {
                    "acquireCount" : {
                            "r" : NumberLong(216)
                    }
            },
            "Collection" : {
                    "acquireCount" : {
                            "r" : NumberLong(216)
                    }
            }
    },
    "nreturned" : 1,
    "responseLength" : 169,
    "protocol" : "op_command",
    "millis" : 4772,
    "planSummary" : "IXSCAN { ip: 1 }",
    "execStats" : {
            "stage" : "CACHED_PLAN",
            "nReturned" : 1,
            "executionTimeMillisEstimate" : 4777,
            "works" : 2,
            "advanced" : 1,
            "needTime" : 0,
            "needYield" : 0,
            "saveState" : 215,
            "restoreState" : 215,
            "isEOF" : 1,
            "invalidates" : 0,
            "inputStage" : {
                    "stage" : "PROJECTION",
                    "nReturned" : 1,
                    "executionTimeMillisEstimate" : 4777,
                    "works" : 537,
                    "advanced" : 1,
                    "needTime" : 536,
                    "needYield" : 0,
                    "saveState" : 215,
                    "restoreState" : 215,
                    "isEOF" : 1,
                    "invalidates" : 0,
                    "transformBy" : {
                            "_id" : 0,
                            "sgfhash" : 1
                    },
                    "inputStage" : {
                            "stage" : "SORT",
                            "nReturned" : 1,
                            "executionTimeMillisEstimate" : 4777,
                            "works" : 537,
                            "advanced" : 1,
                            "needTime" : 536,
                            "needYield" : 0,
                            "saveState" : 215,
                            "restoreState" : 215,
                            "isEOF" : 1,
                            "invalidates" : 0,
                            "sortPattern" : {
                                    "_id" : -1
                            },
                            "memUsage" : 769426,
                            "memLimit" : 33554432,
                            "limitAmount" : 1,
                            "inputStage" : {
                                    "stage" : "SORT_KEY_GENERATOR",
                                    "nReturned" : 534,
                                    "executionTimeMillisEstimate" : 4667,
                                    "works" : 536,
                                    "advanced" : 534,
                                    "needTime" : 1,
                                    "needYield" : 0,
                                    "saveState" : 215,
                                    "restoreState" : 215,
                                    "isEOF" : 1,
                                    "invalidates" : 0,
                                    "inputStage" : {
                                            "stage" : "FETCH",
                                            "nReturned" : 534,
                                            "executionTimeMillisEstimate" : 4667,
                                            "works" : 535,
                                            "advanced" : 534,
                                            "needTime" : 0,
                                            "needYield" : 0,
                                            "saveState" : 215,
                                            "restoreState" : 215,
                                            "isEOF" : 1,
                                            "invalidates" : 0,
                                            "docsExamined" : 534,
                                            "alreadyHasObj" : 0,
                                            "inputStage" : {
                                                    "stage" : "IXSCAN",
                                                    "nReturned" : 534,
                                                    "executionTimeMillisEstimate" : 13,
                                                    "works" : 535,
                                                    "advanced" : 534,
                                                    "needTime" : 0,
                                                    "needYield" : 0,
                                                    "saveState" : 215,
                                                    "restoreState" : 215,
                                                    "isEOF" : 1,
                                                    "invalidates" : 0,
                                                    "keyPattern" : {
                                                            "ip" : 1
                                                    },
                                                    "indexName" : "ip_1",
                                                    "isMultiKey" : false,
                                                    "multiKeyPaths" : {
                                                            "ip" : [ ]
                                                    },
                                                    "isUnique" : false,
                                                    "isSparse" : false,
                                                    "isPartial" : false,
                                                    "indexVersion" : 2,
                                                    "direction" : "forward",
                                                    "indexBounds" : {
                                                            "ip" : [
                                                                    "[\"1.2.3.4\", \"1.2.3.4\"]"
                                                            ]
                                                    },
                                                    "keysExamined" : 534,
                                                    "seeks" : 1,
                                                    "dupsTested" : 0,
                                                    "dupsDropped" : 0,
                                                    "seenInvalidated" : 0
                                            }
                                    }
                            }
                    }
            }
    },
    "ts" : ISODate("2018-01-04T01:09:03.361Z"),
    "client" : "127.0.0.1",
    "appName" : "MongoDB Shell",
    "allUsers" : [ ],
    "user" : ""

Which revealed it wasn't using the multi-key index at all. I vaguely remembered the hint command, and making my query more like this:

db.games.find({ ip:  "1.2.3.4" }, { _id: 0, sgfhash: 1} ).hint( "ip_-1__id_-1" ).sort({ _id: -1 }).limit(1);

solved it. Results are now instant.

I realize the query optimizer can't make every query perfect and that is why hint() exists, but it seemed odd it'd work for one direction but not the other, and that explain() would lie to me about what was going to happen vs what the profiling actually showed.



 Comments   
Comment by David Storch [ 16/Aug/19 ]

roy7, I'm returning to this report as part of a sweep of the MongoDB Server Query Team's ticket backlog. As Kelsey mentioned above, the log lines you have provided to us suggest that the bad plan is being generated from the plan cache. It doesn't look like we have enough information in this ticket to determine how that bad plan was cached in the first place, or why it was not evicted.

Do you have any further details about how we might be able to reproduce this issue, possibly with an artificial data set?

Finally, the 4.2 GA release includes SERVER-32452, which is designed to prevent bad plans from remaining in the plan cache. I have no evidence to suggest that these changes would resolve this particular issue. Generally speaking, however, we expect 4.2 to have fewer issues around bad plans remaining in the plan cache for too long.

Comment by Jonathan Roy [ 04/Jan/18 ]

We'd prefer not to share the database itself since it's full of contributor's IP addresses, but if you need any more detailed logs/etc I'm happy to help if you explain the exact things you want me to do.

Comment by Jonathan Roy [ 04/Jan/18 ]

Here's some example log items from /var/log/mongodb/mongod.log before using the hint. I'll sanitize the IP addresses out. I'm checking with the developer if we can securely give you the data file in question so you can reproduce this locally.

When you say mongod logs at loglevel 1, did you mean the below or something else? Thanks.

2018-01-04T03:51:00.295+0100 I COMMAND  [conn4192] command test.games command: find { find: "games", filter: { ip: "1.2.3.4" }, sort: { _id: -1 }, projection: { _id: 0, sgfhash: 1 }, limit: 1 } planSummary: IXSCAN { ip: 1 } keysExamined:534 docsExamined:534 hasSortStage:1 cursorExhausted:1 numYields:364 nreturned:1 reslen:184 locks:{ Global: { acquireCount: { r: 730 } }, Database: { acquireCount: { r: 365 } }, Collection: { acquireCount: { r: 365 } } } protocol:op_query 9142ms

2018-01-04T03:51:15.876+0100 I COMMAND  [conn4194] command test.games command: find { find: "games", filter: { ip: "2.3.4.5" }, sort: { _id: -1 }, projection: { _id: 0, sgfhash: 1 }, limit: 1 } planSummary: IXSCAN { ip: 1 } keysExamined:45 docsExamined:45 hasSortStage:1 cursorExhausted:1 numYields:13 nreturned:1 reslen:184 locks:{ Global: { acquireCount: { r: 28 } }, Database: { acquireCount: { r: 14 } }, Collection: { acquireCount: { r: 14 } } } protocol:op_query 268ms

2018-01-04T03:51:21.552+0100 I COMMAND  [conn4196] command test.games command: find { find: "games", filter: { ip: "3.4.5.6" }, sort: { _id: -1 }, projection: { _id: 0, sgfhash: 1 }, limit: 1 } planSummary: IXSCAN { ip: 1 } keysExamined:43 docsExamined:43 hasSortStage:1 cursorExhausted:1 numYields:37 nreturned:1 reslen:184 locks:{ Global: { acquireCount: { r: 76 } }, Database: { acquireCount: { r: 38 } }, Collection: { acquireCount: { r: 38 } } } protocol:op_query 1189ms

Comment by Kelsey Schubert [ 04/Jan/18 ]

Hi roy7,

Thanks for reporting this issue. It appears that the behavior you are observing is the result of suboptimal plan being cached. The output from Explain bypasses the cache, which is why you're observing different query plans from the profiler and explain. We'll need to dig deeper to understand why the query plan is not being reevaluated/evicted in favor of the query solution you hint at. It would greatly speed our investigation if you were able to to provide a dataset and query that reproduces this issue.

Is it possible to to provide the data files from the affected mongod as well as the complete mongod logs at loglevel 1? I would be happy to create a secure upload portal to ensure that your files would only be visible to MongoDB employees investigating this issue.

Thank you,
Kelsey

Generated at Thu Feb 08 04:30:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.