[SERVER-57513] SBE multi-planning can incorrectly favor a blocking SORT plan over a non-blocking plan Created: 07/Jun/21  Updated: 29/Oct/23  Resolved: 19/Jul/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 5.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Martin Neupauer Assignee: David Storch
Resolution: Fixed Votes: 0
Labels: sbe-post-v1, sbe-rollout
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-57771 SBE explain does not report correct t... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Query Execution 2021-06-28, Query Execution 2021-07-12, Query Execution 2021-07-26
Participants:

 Description   

Rerunning the famous Eliot's chess query game-search-test.js (github.com)

shows different winning plans are selected.

 

We have to understand what is going on



 Comments   
Comment by Vivian Ge (Inactive) [ 06/Oct/21 ]

Updating the fixversion since branching activities occurred yesterday. This ticket will be in rc0 when it’s been triggered. For more active release information, please keep an eye on #server-release. Thank you!

Comment by Githook User [ 19/Jul/21 ]

Author:

{'name': 'David Storch', 'email': 'david.storch@mongodb.com', 'username': 'dstorch'}

Message: SERVER-57513 Prevent early exit in sort stage if there are non-blocking candidate plans
Branch: master
https://github.com/mongodb/mongo/commit/8fd4cd2c997692e3d40afad84064203d71a3127f

Comment by David Storch [ 16/Jun/21 ]

I've figured it out. The query is this:

printjson(db.game_search
              .find({
                  avgRating: {$gt: 1000},
                  turns: {$lt: 250},
                  'clock.init': {$gt: 1},
                  minutes: {$gt: 2, $lt: 150},
              })
              .sort({date: -1})
              .limit(20)
              .explain('allPlansExecution'));

We have single-field indexes on each of the fields mentioned in the match expression as well as on "date". The classic engine selects the following winning plan:

			"stage" : "LIMIT",
			"limitAmount" : 20,
			"inputStage" : {
				"stage" : "FETCH",
				"filter" : {
					"$and" : [
						{
							"minutes" : {
								"$lt" : 150
							}
						},
						{
							"turns" : {
								"$lt" : 250
							}
						},
						{
							"avgRating" : {
								"$gt" : 1000
							}
						},
						{
							"clock.init" : {
								"$gt" : 1
							}
						},
						{
							"minutes" : {
								"$gt" : 2
							}
						}
					]
				},
				"inputStage" : {
					"stage" : "IXSCAN",
					"keyPattern" : {
						"date" : 1
					},
					"indexName" : "date_1",
					"isMultiKey" : false,
					"multiKeyPaths" : {
						"date" : [ ]
					},
					"isUnique" : false,
					"isSparse" : false,
					"isPartial" : false,
					"indexVersion" : 2,
					"direction" : "backward",
					"indexBounds" : {
						"date" : [
							"[MaxKey, MinKey]"
						]
					}
				}
			}

This plan scans the entire {date: 1} index, thereby avoiding a blocking SORT, and applies the filter to each of the corresponding documents. It ran in 58 milliseconds on my machine. SBE, on the other hand, is picking the following plan:

                                "stage" : "SORT",
                                "planNodeId" : 3,
                                "sortPattern" : {
                                        "date" : -1
                                },
                                "memLimit" : 104857600,
                                "limitAmount" : 20,
                                "type" : "simple",
                                "inputStage" : {
                                        "stage" : "FETCH",
                                        "planNodeId" : 2,
                                        "filter" : {
                                                "$and" : [
                                                        {
                                                                "avgRating" : {
                                                                        "$gt" : 1000
                                                                }
                                                        },
                                                        {
                                                                "clock.init" : {
                                                                        "$gt" : 1
                                                                }
                                                        },
                                                        {
                                                                "turns" : {
                                                                        "$lt" : 250
                                                                }
                                                        }
                                                ]
                                        },
                                        "inputStage" : {
                                                "stage" : "IXSCAN",
                                                "planNodeId" : 1,
                                                "keyPattern" : {
                                                        "minutes" : 1
                                                },
                                                "indexName" : "minutes_1",
                                                "isMultiKey" : false,
                                                "multiKeyPaths" : {
                                                        "minutes" : [ ]
                                                },
                                                "isUnique" : false,
                                                "isSparse" : false,
                                                "isPartial" : false,
                                                "indexVersion" : 2,
                                                "direction" : "forward",
                                                "indexBounds" : {
                                                        "minutes" : [
                                                                "(2.0, 150.0)"
                                                        ]
                                                }
                                        }
                                }
                        },

This is a bounded scan on the {minutes: 1} index followed by a blocking SORT. It ran in 120 milliseconds on my machine, which seems to confirm that the classic engine chose the right plan and the SBE engine chose the wrong one.

The allPlansExecution output was unhelpful for diagnosing this bug due to SERVER-57771, so I relied upon the scoring information that is written to the logs when the "query" log component is 2 or higher:

{"t":{"$date":"2021-06-16T16:41:41.429-04:00"},"s":"D2", "c":"QUERY",    "id":20957,   "ctx":"conn8","msg":"Scoring query plan","attr":{"planSummary":"IXSCAN { avgRating: 1 }","planHitEOF":false}}
{"t":{"$date":"2021-06-16T16:41:41.429-04:00"},"s":"D2", "c":"QUERY",    "id":20961,   "ctx":"conn8","msg":"Score formula","attr":{"formula":"score(1.0003780094523214) = baseScore(1) + productivity((1 advances)/(3596 numReads) = 0.00027800945232137893) + tieBreakers(0 noFetchBonus + 0 noSortBonus + 0.0001 noIxisectBonus = 0.0001)"}}
{"t":{"$date":"2021-06-16T16:41:41.429-04:00"},"s":"D2", "c":"QUERY",    "id":20957,   "ctx":"conn8","msg":"Scoring query plan","attr":{"planSummary":"IXSCAN { turns: 1 }","planHitEOF":false}}
{"t":{"$date":"2021-06-16T16:41:41.429-04:00"},"s":"D2", "c":"QUERY",    "id":20961,   "ctx":"conn8","msg":"Score formula","attr":{"formula":"score(1.0003171552660153) = baseScore(1) + productivity((1 advances)/(4604 numReads) = 0.00021715526601520088) + tieBreakers(0 noFetchBonus + 0 noSortBonus + 0.0001 noIxisectBonus = 0.0001)"}}
{"t":{"$date":"2021-06-16T16:41:41.429-04:00"},"s":"D2", "c":"QUERY",    "id":20957,   "ctx":"conn8","msg":"Scoring query plan","attr":{"planSummary":"IXSCAN { minutes: 1 }","planHitEOF":false}}
{"t":{"$date":"2021-06-16T16:41:41.429-04:00"},"s":"D2", "c":"QUERY",    "id":20961,   "ctx":"conn8","msg":"Score formula","attr":{"formula":"score(1.0154846153846153) = baseScore(1) + productivity((1 advances)/(64 numReads) = 0.015384615384615385) + tieBreakers(0 noFetchBonus + 0 noSortBonus + 0.0001 noIxisectBonus = 0.0001)"}}
{"t":{"$date":"2021-06-16T16:41:41.429-04:00"},"s":"D2", "c":"QUERY",    "id":20957,   "ctx":"conn8","msg":"Scoring query plan","attr":{"planSummary":"IXSCAN { clock.init: 1 }","planHitEOF":false}}
{"t":{"$date":"2021-06-16T16:41:41.429-04:00"},"s":"D2", "c":"QUERY",    "id":20961,   "ctx":"conn8","msg":"Score formula","attr":{"formula":"score(1.0002991635132443) = baseScore(1) + productivity((1 advances)/(5020 numReads) = 0.00019916351324437363) + tieBreakers(0 noFetchBonus + 0 noSortBonus + 0.0001 noIxisectBonus = 0.0001)"}}
{"t":{"$date":"2021-06-16T16:41:41.429-04:00"},"s":"D2", "c":"QUERY",    "id":20957,   "ctx":"conn8","msg":"Scoring query plan","attr":{"planSummary":"IXSCAN { date: 1 }","planHitEOF":false}}
{"t":{"$date":"2021-06-16T16:41:41.429-04:00"},"s":"D2", "c":"QUERY",    "id":20961,   "ctx":"conn8","msg":"Score formula","attr":{"formula":"score(1.003642058678905) = baseScore(1) + productivity((21 advances)/(6100 numReads) = 0.0034420586789050975) + tieBreakers(0 noFetchBonus + 0.0001 noSortBonus + 0.0001 noIxisectBonus = 0.00020000000000000001)"}}
{"t":{"$date":"2021-06-16T16:41:41.430-04:00"},"s":"D2", "c":"QUERY",    "id":4822876, "ctx":"conn8","msg":"Winning plan","attr":{"planSummary":"IXSCAN { minutes: 1 }"}}

This shows scoring information for the five candidate plans. The first four involve a blocking sort, whereas the last is the non-blocking plan which scans the {date: 1} index. I confirmed using gdb that each of the first four blocking sort plans early-exited with a QueryTrialRunCompleted exception. This means that they each returned zero results during the trial period; the one advance listed in the productivity formula is due to this logic. Namely, we always add one to the actual number of advances. The fifth plan returned all 20 results during the trial period, resulting in 21 being used as the numerator for calculating the productivity ratio.

So why did the {minutes: 1} plan end up beating the {date: 1} plan. Let's compare their productivity ratios:

  • minutes: (1 advances)/(64 numReads) = 0.0154
  • date: (21 advances)/(6100 numReads) = 0.0034

The {minutes: 1} plan appears to be more productive because its denominator is so small. But why is the denominator so small? This means that the trial period ended after performing only about 60 reads from storage, whereas the trial period ended for all the other candidate plans after performing thousands of reads. The reason has to do with the early-exit logic implemented in the SBE sort stage. The current logic is to abort the trial period after a certain number of results have been ingested by the sort stage. In this case, that number of results is 20 because of the limit(20) in the query. This logic seems to be flawed, because we are scoring the plan not based on the ratio of the number of results ingested by the sort to numReads, but based on the ratio of the number of results ultimately returned at the root of the tree to numReads. Although it is true that the blocking sort plan which returns its first batch of results to the sort stage fastest is the best amongst the set of candidate plans which have a blocking sort, it doesn't seem valid to compare the resulting score of a blocking sort plan to a non-blocking one. This example shows that the blocking sort plan that exits earliest can have an artificially inflated score relative to candidate plans which do not have a blocking sort stage.

I see two possible ways to fix SBE's plan selection logic for sorts:

  • Simply eliminate the sort stage's early exit logic, and rely on the ixscan/scan stages' early exit logic to terminate the trial period after a maximum number of reads from a storage cursor have been performed. That way, each candidate with a blocking SORT is given the same budget of reads, and won't exit super early.
  • Don't always calculate "advanced" in the productivity formula (advanced + 1) / (numReads + 1) based on the root of the SBE tree. Instead, identify the "data access portion" of the SBE tree. In other words, if the SBE plan involves a sort stage, use the number of advances of the sort stage's child rather than the number of advances at the root. This means that the productivity formula would always be measuring the performance of the data access portion of the plan, and would not take into account downstream operations like sorting or grouping at all. The upside of this approach is that it is more consistent with what we will likely do as we try to push additional operations into the SBE layer, such as $group or $lookup. The downside is that it is significantly different from what the classic engine does. Also, we would not be incorporating the cost of sorting into the scoring formula, which might cause our plan ranking logic to favor blocking sort plans when it should not.
Comment by Kyle Suarez [ 15/Jun/21 ]

martin.neupauer, could you share what plan we expect and what plan we actually get with SBE enabled?

Generated at Thu Feb 08 05:42:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.