[SERVER-31078] Query planning is very slow during multiplanning when result set is empty Created: 13/Sep/17 Updated: 24/Jul/23 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Mark Callaghan | Assignee: | Backlog - Query Optimization |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Assigned Teams: |
Query Optimization
|
||||||||||||||||||||
| Sprint: | Query 2017-10-23 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||
| Description |
|
The MultiPlanStage picks the best plan by "working all the plans, stopping when a plan hits EOF or returns some fixed number of results". But if the query returns no results, this stopping condition can take a very long time and we've essentially done a full index scan before returning. Original DescriptionI updated my insert benchmark scripts to do full index scans for the PK and secondary indexes. The queries are written to do full index scans and find no rows. There are two loops for the timing – first run the query, second explain the query. Two of the queries have slow explains – queries 2 and 4. From the explain output they each have rejected plans that use a sort. For the output below, there are 4 queries. Query 1 is the PK scan and then queries 2, 3 and 4 are full scans of each secondary index. The data below is from a setup with 10M docs in one collection. This is the time to scan each index:
And this is the time to do the explain for each query. Note that the explain for queries 2 and 4 takes more than 10 seconds. For the queries with slow explains there were rejected plans that use a sort.
A sample command line for repeating the test using https://github.com/mdcallag/mytools/blob/master/bench/ibench/iql.sh Planner output for the slow explain with query 2. There is an index on (price,customerid) and the query is:
|
| Comments |
| Comment by David Storch [ 08/Dec/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks kyle.suarez. We've seen a couple cases like this before, e.g. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kyle Suarez [ 06/Nov/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi mdcallag, Sorry for the delay in getting back to you. I can reproduce this behavior and it confirms our suspicions about the trial period for plan selection. I wrote my own script that roughly simulates your benchmark and populates a collection with 10000000 documents and builds two compound indexes:
To get more accurate timing information on the plan selection trial period, I made the following modifications to the multiplanner:
A query with one expected result in the center of the data set is very fast and working all the plans takes about 326 microseconds. A query that returns zero results takes forever: 7057401 microseconds (7.056 seconds) to select the best plan, which is basically all of the round-trip time as measured by the JS shell script. Thanks for the report; I've updated the description and am sending this to the Query Team to triage at the next planning meeting. Regards, P.S. My full standalone repro script is here: https://gist.github.com/ksuarz/ace0a62ddb3da958efa44c3e5e779b15 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kyle Suarez [ 14/Sep/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi mdcallag, thanks for filing this; I'll be taking a look at it in the coming weeks. david.storch believes that the query returns no results is significant. For such queries, the plan ranker is unable to determine which of its plans is the best because none of them return any results. This might lead to an extended trial period that aversely affects query timing. However, I'll have to take a deeper dive at the code to be sure. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mark Callaghan [ 14/Sep/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Here is an example from Mongo 3.4.6 for mmapv1, WiredTiger and MongoRocks. For each engine there are 2 groups of 4 lines. The first (explain 0) has the number of seconds to run the query for each index (Query 1 is PK, Query 2, 3 & 4 are the secondary indexes). The second group (explain 1) has the number of seconds for the explain. From below the explain for query 2 and query 4 take half as long as the query, and in the MongoRocks case that is about 4 minutes. From mmapv1
From WiredTiger
From MongoRocks
|