[SERVER-41872] PlanEnumerator AndAssignment::choices ordering not stable and is relevant to set of plans generated Created: 21/Jun/19  Updated: 29/Oct/23  Resolved: 18/Oct/19

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.2.0-rc1
Fix Version/s: 4.3.1, 4.2.4, 4.0.21

Type: Bug Priority: Major - P3
Reporter: James Wahlin Assignee: James Wahlin
Resolution: Fixed Votes: 0
Labels: greenerbuild, query-44-grooming
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Problem/Incident
Related
is related to SERVER-38249 stdx::unordered_map should be impleme... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.2, v4.0, v3.6
Sprint: Query 2019-07-15, Query 2019-07-29, Query 2019-08-12, Query 2019-08-26
Participants:
Linked BF Score: 0

 Description   

The following diff includes a jstest that will shift between 2 sets of candidate plans depending on the run. We would expect the set of plans generated to remain constant across test executions.

diff --git a/jstests/core/repro.js b/jstests/core/repro.js
new file mode 100644
index 0000000000..511d1819ef
--- /dev/null
+++ b/jstests/core/repro.js
@@ -0,0 +1,27 @@
+
+db.coll.drop();
+db.coll.insert({});
+db.coll.createIndexes([
+    {"obj.obj.obj.obj.str": 1, "obj.obj.obj.str": 1},
+    {"obj.obj.obj.str": -1, "date": -1},
+    {"obj.date": 1},
+    {"obj.obj.obj.str": 1, "obj.obj.obj.obj.date": -1},
+    {"obj.date": -1, "obj.obj.obj.str": 1},
+    {"obj.date": 1, "obj.obj.num": 1},
+    {"obj.obj.obj.str": 1, "obj.obj.obj.date": -1},
+    {"obj.obj.obj.obj.str": 1, "obj.date": 1}
+]);
+var result =
+    db.coll
+        .aggregate([
+            {
+              $match: {
+                  $and: [
+                      {$or: [{'obj.obj.obj.str': {$not: {$in: []}}}, {'obj.date': {$nin: []}}]},
+                      {'obj.obj.obj.obj.str': {$not: {$lte: ''}}}
+                  ]
+              }
+            },
+            {$project: {"obj.obj.obj.obj.str": 1, "_id": 0}}
+        ])
+        .toArray();
diff --git a/src/mongo/db/query/query_planner.cpp b/src/mongo/db/query/query_planner.cpp
index 9735dbade0..4145df3056 100644
--- a/src/mongo/db/query/query_planner.cpp
+++ b/src/mongo/db/query/query_planner.cpp
@@ -828,7 +828,7 @@ StatusWith<std::vector<std::unique_ptr<QuerySolution>>> QueryPlanner::plan(
     // Don't leave tags on query tree.
     query.root()->resetTag();
 
-    LOG(5) << "Planner: outputted " << out.size() << " indexed solutions.";
+    LOG(0) << "Planner: outputted " << out.size() << " indexed solutions.";
 
     // Produce legible error message for failed OR planning with a TEXT child.
     // TODO: support collection scan for non-TEXT children of OR.



 Comments   
Comment by Githook User [ 03/Nov/20 ]

Author:

{'name': 'Charlie Swanson', 'email': 'charlie.swanson@mongodb.com', 'username': 'cswanson310'}

Message: Revert "SERVER-41872 PlanEnumerator AndAssignment::choices ordering not stable and is relevant to set of plans generated"

This reverts commit b010331e00cb46efe4e98e0f74150854b4407d67.
Branch: v3.6
https://github.com/mongodb/mongo/commit/d7231fc0c2a18d101d315ff68349b95dfb264150

Comment by Githook User [ 30/Sep/20 ]

Author:

{'name': 'James Wahlin', 'email': 'james.wahlin@mongodb.com', 'username': 'jameswahlin'}

Message: SERVER-41872 PlanEnumerator AndAssignment::choices ordering not stable and is relevant to set of plans generated

(cherry picked from commit 6ca9be706ce03a2c79d29feac3cfd294cc09445d)
(cherry picked from commit 4af99c4f2845a228fab9c4aaaafe513e21c792a5)
(cherry picked from commit 5959f04d2c6c661a5446be04264e0f5fa79747d8)
Branch: v3.6
https://github.com/mongodb/mongo/commit/b010331e00cb46efe4e98e0f74150854b4407d67

Comment by Githook User [ 30/Sep/20 ]

Author:

{'name': 'James Wahlin', 'email': 'james.wahlin@mongodb.com', 'username': 'jameswahlin'}

Message: SERVER-41872 PlanEnumerator AndAssignment::choices ordering not stable and is relevant to set of plans generated

(cherry picked from commit 6ca9be706ce03a2c79d29feac3cfd294cc09445d)
(cherry picked from commit 4af99c4f2845a228fab9c4aaaafe513e21c792a5)
Branch: v4.0
https://github.com/mongodb/mongo/commit/5959f04d2c6c661a5446be04264e0f5fa79747d8

Comment by Githook User [ 07/Feb/20 ]

Author:

{'name': 'James Wahlin', 'username': 'jameswahlin', 'email': 'james.wahlin@mongodb.com'}

Message: SERVER-41872 PlanEnumerator AndAssignment::choices ordering not stable and is relevant to set of plans generated

(cherry picked from commit 6ca9be706ce03a2c79d29feac3cfd294cc09445d)
Branch: v4.2
https://github.com/mongodb/mongo/commit/4af99c4f2845a228fab9c4aaaafe513e21c792a5

Comment by Githook User [ 18/Oct/19 ]

Author:

{'username': 'jameswahlin', 'email': 'james.wahlin@mongodb.com', 'name': 'James Wahlin'}

Message: SERVER-41872 PlanEnumerator AndAssignment::choices ordering not stable and is relevant to set of plans generated
Branch: master
https://github.com/mongodb/mongo/commit/6ca9be706ce03a2c79d29feac3cfd294cc09445d

Comment by James Wahlin [ 27/Sep/19 ]

On evaluating this again today, I found that we can replace the stdx::unordered_map with std::map for IndexToPredMap. Performance is in-line with existing implementation and this provides stable ordering. My earlier findings detailed above were incorrect due to a log level 0 debug message I forgot to remove when testing. Will post a patch for this next week.

Comment by James Wahlin [ 13/Sep/19 ]

I spent time today reviewing potential fixes. The three options I identified are:
1) Sorting the memo's choices list post-construction
2) Sorting the PlanEnumerator's 'idxToFirst' map into an ordered reference list prior to iteration
3) Changing the data structure backing 'idxToFirst' from an unordered to an ordered map.

The simplest solution would be #3. Use of a std::map would allow for us to maintain a stable iteration state. I ran a perf patch which changed both idxToFirst and idxToNotFirst to use std::set. This proved to have a meaningful impact on performance so won't be viable. I plan to follow this with a patch that only makes the change to idxToFirst, but suspect there may still be an impact as we perform lookups in this structure as well.

For #1, 'choices' is a list of AndEnumerableState, which then contains 2 vectors containing 'OneIndexAssignment' and 'MemoID'. It is not immediately clear to me how we would want to sort this list though we could spend more time thinking about it.

Comment by Charlie Swanson [ 12/Sep/19 ]

james.wahlin can you give a shot at fixing this as part of BF Friday? Thanks!

Comment by James Wahlin [ 04/Sep/19 ]

david.storch to clarify, in the 0 and 1 solution case, the PlanEnumerator generates 12 plans which are then passed to QueryPlannerAnalysis::analyzeDataAccess() to determine whether they are valid solutions. QueryPlannerAnalysis::analyzeDataAccess() is the step that is removing either 11 or 12 plans from contention. The reason that plans are removed is that they are not covered plans and QueryPlanner::plan() is executed with the NO_UNCOVERED_PROJECTIONS option. When 0 solutions are produced, QueryPlanner::plan() is executed a second time with the projection omitted. This allows for all 12 plans to be considered.

Comment by David Storch [ 30/Aug/19 ]

james.wahlin, thanks for the detailed write-up! There's a part of this that I'm missing, though. Based on the memo structures from the logs, it looks like there are a large number of valid plans that could be output by the plan enumerator. Therefore, I'm confused as to how we end up with either 0 or 1 total solutions produced (depending on the ordering of the memo). Are solutions being generated and discarded? I'm not aware of any case where we would do this other than discarding index intersection plans, but in both memos it looks like we would generate plans without ixisect first and then move onto ixisect plans later.

Comment by James Wahlin [ 28/Aug/19 ]

This is caused by a combination of the following:

  • stdx:unordered_set (backed by absl::node_hash_map) does not provide a stable iteration order
  • Ordering differences can produce PlanEnumerator memos that are identical in content but not in choice ordering within nodes. The choice ordering dictates the order that the PlanEnumerator will generate solutions
  • In order to limit the number of plans generated, the PlanEnumerator has short-circuit behavior which will prevent generating further plans once a limit is reached

For our reproduction case, the initial call to QueryPlanner::plan() generates either zero or one solution. The following is an example PlanEnumerator memo generated for each:

Zero solution memo

[Node #1]:
   choice 0:
           subnodes:
           idx[5]
                   pos 0 pred obj.date $in [ ]
 
   choice 1:
           subnodes:
           idx[4]
                   pos 0 pred obj.date $in [ ]
 
   choice 2:
           subnodes:
           idx[7]
                   pos 1 pred obj.date $in [ ]
           orPushdownPred: obj.obj.obj.obj.str $lte ""
 
   choice 3:
           subnodes:
           idx[2]
                   pos 0 pred obj.date $in [ ]
 
[Node #2]:
   choice 0:
           subnodes:
           idx[1]
                   pos 0 pred obj.obj.obj.str $in [ ]
 
   choice 1:
           subnodes:
           idx[3]
                   pos 0 pred obj.obj.obj.str $in [ ]
 
   choice 2:
           subnodes:
           idx[0]
                   pos 1 pred obj.obj.obj.str $in [ ]
           orPushdownPred: obj.obj.obj.obj.str $lte ""
 
   choice 3:
           subnodes:
           idx[6]
                   pos 0 pred obj.obj.obj.str $in [ ]
 
[Node #3]: ALL OF: [ 1 2 ]
[Node #4]:
   choice 0:
           subnodes: 3
 
   choice 1:
           subnodes:
           idx[7]
                   pos 0 pred obj.obj.obj.obj.str $lte ""
 
   choice 2:
           subnodes:
           idx[0]
                   pos 0 pred obj.obj.obj.obj.str $lte ""
 
   choice 3:
           subnodes: 3
           idx[7]
                   pos 0 pred obj.obj.obj.obj.str $lte ""
 
   choice 4:
           subnodes: 3
           idx[0]
                   pos 0 pred obj.obj.obj.obj.str $lte ""

One solution memo

[Node #1]:
   choice 0:
           subnodes:
           idx[4]
                   pos 0 pred obj.date $in [ ]
 
   choice 1:
           subnodes:
           idx[7]
                   pos 1 pred obj.date $in [ ]
           orPushdownPred: obj.obj.obj.obj.str $lte ""
 
   choice 2:
           subnodes:
           idx[2]
                   pos 0 pred obj.date $in [ ]
 
   choice 3:
           subnodes:
           idx[5]
                   pos 0 pred obj.date $in [ ]
 
[Node #2]:
   choice 0:
           subnodes:
           idx[3]
                   pos 0 pred obj.obj.obj.str $in [ ]
 
   choice 1:
           subnodes:
           idx[0]
                   pos 1 pred obj.obj.obj.str $in [ ]
           orPushdownPred: obj.obj.obj.obj.str $lte ""
 
   choice 2:
           subnodes:
           idx[1]
                   pos 0 pred obj.obj.obj.str $in [ ]
 
   choice 3:
           subnodes:
           idx[6]
                   pos 0 pred obj.obj.obj.str $in [ ]
 
[Node #3]: ALL OF: [ 1 2 ]
[Node #4]:
   choice 0:
           subnodes: 3
 
   choice 1:
           subnodes:
           idx[0]
                   pos 0 pred obj.obj.obj.obj.str $lte ""
 
   choice 2:
           subnodes:
           idx[7]
                   pos 0 pred obj.obj.obj.obj.str $lte ""
 
   choice 3:
           subnodes: 3
           idx[0]
                   pos 0 pred obj.obj.obj.obj.str $lte ""
 
   choice 4:
           subnodes: 3
           idx[7]
                   pos 0 pred obj.obj.obj.obj.str $lte ""

The solution that is produced in one case and not in the other is represented by the following Node/Choice combinations in the above memos:

Solutions Produced Node 1 Node 2 Node 4
0 Choice 2 Choice 2 Choice 0
1 Choice 1 Choice 1 Choice 0

In the case where we produce zero solutions, the PlanEnumerator reaches a point where the OrAssignment counter is equal to _orLimit (aka internalQueryEnumerationMaxOrSolutions) prior to reaching the valid solution. In the case where we produce one solution, the valid solution is generated at an earlier point, prior to reaching this limit.

Given the above, we will need a way to ensure that we enumerate over all plans in the same order, regardless of the iteration order provided by the PlanEnumerator's unordered_set structures. Additionally we should evaluate whether the limits we have in place are reasonable for a case like this. It is unfortunate in this case that our limits lead us to produce no solutions when one exists.

Comment by James Wahlin [ 27/Aug/19 ]

The cause for this change in behavior is SERVER-38249, "stdx::unordered_map should be implemented as absl::node_hash_map".

It is likely that the absl::node_hash_map ordering is less stable then our previous unordered_map implementation for this use case, exposing us to different table ordering on subsequent runs. Next step is to understand where in our PlanEnumerator code it is that IndexToPredMap order can impact the results of PlanEnumeration.

Comment by James Wahlin [ 27/Aug/19 ]

Currently in the PlanEnumerator we use an unordered_map to store AND node index intersection assignments. We did work under SERVER-12196 to make sure that enumeration results are independent of the ordering encountered when iterating over this map. It looks a change has been introduced for 4.2 which makes enumeration results now order dependent and causes the plan generation flip-flop seen here. Changing the IndexToPredMap typedef from stdx::unordered_map to std::map stabilizes plan generation.

Next step on this ticket is to understand how this was broken for the provided reproducer. It is worth noting that we would like stable plan selection regardless of both unordered_map usage and of index creation order. This means that just changing IndexToPredMap from unordered_map to map is not likely to be sufficient and could also hurt performance.

Comment by James Wahlin [ 03/Jul/19 ]

I spent more time looking at this today. I found that the inputs to QueryPlanner::plan() are identical for both plan set variants.

Where things start to differ is in plan enumeration. Dumping the memo for each to a log string shows that while the content appears to be the same, the ordering within the "choices" list is different.

The MatchExpression list structure then changes after a call to resolveOrPushdowns() within prepareForAccessPlanning(). We then perform access planning, and find a single covered solution (which is accepted) in one case and no covered solutions in the other.

My guess at this point is that the "choice" ordering within the memo affects the point in which we perform an OR pushdown and that mutation of the tree for OR pushdown impacts subsequent access planning. This is just a theory - I don't have evidence yet to back this.

Here is an update to the patch above which includes logging for the memo and MatchExpression tree post OR pushdown. To compare between the 2 sets of plans, I did the following:

  1. Build mongod with the following patch
  2. Run the included repro.js script. You will see that it either:
    1. Performs a single QueryPlanner::plan() invocation which produces a single covered plan
    2. Or performs two QueryPlanner::plan() invocations, the first matching 0 plans and the second matching 12
  3. Diff the debug logging for the "1 plan" and "0 plan" output.

diff --git a/jstests/core/repro.js b/jstests/core/repro.js
new file mode 100644
index 0000000000..511d1819ef
--- /dev/null
+++ b/jstests/core/repro.js
@@ -0,0 +1,27 @@
+
+db.coll.drop();
+db.coll.insert({});
+db.coll.createIndexes([
+    {"obj.obj.obj.obj.str": 1, "obj.obj.obj.str": 1},
+    {"obj.obj.obj.str": -1, "date": -1},
+    {"obj.date": 1},
+    {"obj.obj.obj.str": 1, "obj.obj.obj.obj.date": -1},
+    {"obj.date": -1, "obj.obj.obj.str": 1},
+    {"obj.date": 1, "obj.obj.num": 1},
+    {"obj.obj.obj.str": 1, "obj.obj.obj.date": -1},
+    {"obj.obj.obj.obj.str": 1, "obj.date": 1}
+]);
+var result =
+    db.coll
+        .aggregate([
+            {
+              $match: {
+                  $and: [
+                      {$or: [{'obj.obj.obj.str': {$not: {$in: []}}}, {'obj.date': {$nin: []}}]},
+                      {'obj.obj.obj.obj.str': {$not: {$lte: ''}}}
+                  ]
+              }
+            },
+            {$project: {"obj.obj.obj.obj.str": 1, "_id": 0}}
+        ])
+        .toArray();
diff --git a/src/mongo/db/query/index_tag.cpp b/src/mongo/db/query/index_tag.cpp
index 61e0204d3d..7d0db755d9 100644
--- a/src/mongo/db/query/index_tag.cpp
+++ b/src/mongo/db/query/index_tag.cpp
@@ -293,7 +293,9 @@ void resolveOrPushdowns(MatchExpression* tree) {
 const size_t IndexTag::kNoIndex = std::numeric_limits<size_t>::max();
 
 void prepareForAccessPlanning(MatchExpression* tree) {
+    // tree->toString() is identical prior to resolveOrPushdowns call.
     resolveOrPushdowns(tree);
+    std::cout << "After resolveOrPushdowns:  " << tree->toString() << std::endl;
     sortUsingTags(tree);
 }
 
diff --git a/src/mongo/db/query/plan_enumerator.h b/src/mongo/db/query/plan_enumerator.h
index 33c7923119..09d5a37dc0 100644
--- a/src/mongo/db/query/plan_enumerator.h
+++ b/src/mongo/db/query/plan_enumerator.h
@@ -107,6 +107,7 @@ public:
      * added in order to indicate index usage.
      */
     std::unique_ptr<MatchExpression> getNext();
+    std::string dumpMemo();
 
 private:
     //
@@ -520,8 +521,6 @@ private:
      */
     MemoID memoIDForNode(MatchExpression* node);
 
-    std::string dumpMemo();
-
     // Map from expression to its MemoID.
     stdx::unordered_map<MatchExpression*, MemoID> _nodeToId;
 
diff --git a/src/mongo/db/query/query_planner.cpp b/src/mongo/db/query/query_planner.cpp
index 9735dbade0..bf893c32ae 100644
--- a/src/mongo/db/query/query_planner.cpp
+++ b/src/mongo/db/query/query_planner.cpp
@@ -781,6 +781,7 @@ StatusWith<std::vector<std::unique_ptr<QuerySolution>>> QueryPlanner::plan(
 
         PlanEnumerator isp(enumParams);
         isp.init().transitional_ignore();
+        std::cout << isp.dumpMemo() << std::endl;
 
         unique_ptr<MatchExpression> nextTaggedTree;
         while ((nextTaggedTree = isp.getNext()) && (out.size() < params.maxIndexedSolutions)) {
@@ -812,9 +813,10 @@ StatusWith<std::vector<std::unique_ptr<QuerySolution>>> QueryPlanner::plan(
                 continue;
             }
 
+            std::cout << "solnRoot->fetched(): " << solnRoot->fetched() << std::endl;
             auto soln = QueryPlannerAnalysis::analyzeDataAccess(query, params, std::move(solnRoot));
             if (soln) {
-                LOG(5) << "Planner: adding solution:" << endl << redact(soln->toString());
+                std::cout << "Adding Solution" << std::endl;
                 if (statusWithCacheData.isOK()) {
                     SolutionCacheData* scd = new SolutionCacheData();
                     scd->tree = std::move(cacheData);
@@ -828,7 +830,7 @@ StatusWith<std::vector<std::unique_ptr<QuerySolution>>> QueryPlanner::plan(
     // Don't leave tags on query tree.
     query.root()->resetTag();
 
-    LOG(5) << "Planner: outputted " << out.size() << " indexed solutions.";
+    std::cout << "Planner: outputted " << out.size() << " indexed solutions." << std::endl;
 
     // Produce legible error message for failed OR planning with a TEXT child.
     // TODO: support collection scan for non-TEXT children of OR.

Generated at Thu Feb 08 04:58:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.