[SERVER-83091] $or query can trigger an infinite loop during plan enumeration Created: 09/Nov/23  Updated: 20/Dec/23  Resolved: 28/Nov/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 7.0.2, 7.0.3, 7.2.0-rc1
Fix Version/s: 7.3.0-rc0, 7.2.0-rc2, 7.0.5, 6.0.13, 5.0.24

Type: Bug Priority: Critical - P2
Reporter: David Storch Assignee: David Storch
Resolution: Fixed Votes: 0
Labels: bkp
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Fix
Related
related to SERVER-83602 $or -> $in MatchExpression rewrite sh... Closed
is related to SERVER-50291 Add query knob to enumerate $or child... Closed
is related to SERVER-81630 Enable Boolean expression simplifier Closed
is related to SERVER-74893 Change default enumeration strategy f... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v7.2, v7.0, v6.0, v5.0
Sprint: QO 2023-11-13, QO 2023-11-27, QO 2023-12-11
Participants:
Case:
Linked BF Score: 14

 Description   

Certain $or queries can trigger an infinite loop in the PlanEnumerator. While the plan space can be quite large for indexed $or queries, we have logic which is supposed to bail out of plan enumeration once a certain number of alternatives have been generated. This logic is not kicking in correctly.

Note that this only affects the so-called "lockstep $or" enumeration order which was first implemented in SERVER-50291. Lockstep $or enumeration is intended to make it more likely that the system generates and scores interesting indexed $or plans, in the end resulting in better access paths for indexed $or predicates. It was enabled by default in SERVER-74893 for versions 7.1.0-rc0 and 7.0.2. Many customers are first getting exposed to this new default $or plan enumeration logic as they upgrade to 7.0.2 which made us aware of the bug. The issue can be worked around safely by disabling lockstep $or enumeration. This can be done on the command line, in a config file, or with a shell command such as the following:

db.adminCommand({setParameter: 1, internalQueryEnumerationPreferLockstepOrEnumeration: false});

Note that the infinite loop does not include any interrupt points, so the affected queries are impossible to kill and can block the server from shutting down. It may also manifest as increased CPU activity, since the thread is spinning busily rather than deadlocked.

As of this writing, I've confirmed that 7.0.2 is affected, but I plan to double-check whether more recent versions are affected by the bug as well.



 Comments   
Comment by Githook User [ 01/Dec/23 ]

Author:

{'name': 'David Storch', 'email': 'david.storch@mongodb.com', 'username': 'dstorch'}

Message: SERVER-83091 Fix infinite loop in lockstep $or plan enumeration

(cherry picked from commit 0b8d487a19477c72a64676db203b84763a363bbe)

GitOrigin-RevId: 6dceb9b856961509390e3ff06f035eb65d81d74f
Branch: v5.0
https://github.com/mongodb/mongo/commit/23ed09bed9fbf327117205d7c8d54f963d51847a

Comment by Githook User [ 01/Dec/23 ]

Author:

{'name': 'David Storch', 'email': 'david.storch@mongodb.com', 'username': 'dstorch'}

Message: SERVER-83091 Fix infinite loop in lockstep $or plan enumeration

(cherry picked from commit 479bb27e76bff47398a7c1d3060e818645cef8ed)
Branch: v6.0
https://github.com/mongodb/mongo/commit/0b8d487a19477c72a64676db203b84763a363bbe

Comment by Githook User [ 28/Nov/23 ]

Author:

{'name': 'David Storch', 'email': 'david.storch@mongodb.com', 'username': 'dstorch'}

Message: SERVER-83091 Fix infinite loop in lockstep $or plan enumeration

(cherry picked from commit f84e9c8325f3575929accc645e857c8d00c93391)
Branch: v7.0
https://github.com/mongodb/mongo/commit/479bb27e76bff47398a7c1d3060e818645cef8ed

Comment by Githook User [ 28/Nov/23 ]

Author:

{'name': 'David Storch', 'email': 'david.storch@mongodb.com', 'username': 'dstorch'}

Message: SERVER-83091 Fix infinite loop in lockstep $or plan enumeration

(cherry picked from commit a36f67db89cda0f0b9fbe54a97290a21616efd5d)
Branch: v7.2
https://github.com/mongodb/mongo/commit/f84e9c8325f3575929accc645e857c8d00c93391

Comment by Githook User [ 28/Nov/23 ]

Author:

{'name': 'David Storch', 'email': 'david.storch@mongodb.com', 'username': 'dstorch'}

Message: SERVER-83091 Fix infinite loop in lockstep $or plan enumeration
Branch: master
https://github.com/mongodb/mongo/commit/a36f67db89cda0f0b9fbe54a97290a21616efd5d

Comment by David Storch [ 13/Nov/23 ]

I've solved the mystery of why this initially appeared as though it did not affect the master branch. The reason is that the issue is related to nested $or predicates, but these get simplified by the new boolean simplification module which was recently enabled under SERVER-81630. If I disable this boolean simplification step with the following command, then I can easily reproduce the issue:

db.adminCommand({setParameter: 1, internalQueryEnableBooleanExpressionsSimplifier: false});

Boolean simplification makes this issue much harder to reproduce, though it still seems theoretically possible to construct an example in which nested $or predicates do not get flattened (e.g. if boolean simplification doesn't kick in) that is subject to the bug. Moreover, the system shouldn't depend on boolean simplification to behave correctly. I'm marking the most recent release (7.2.0-rc0) as an affected version. The fix should be implemented in the master branch and backported from there in the usual way.

Comment by David Storch [ 10/Nov/23 ]

After digging through my repro in a debugger, I think I have a sense of what's going wrong. The situation is that the PlanEnumerator memo data structure has two LockstepOrAssignment nodes, where one refers to the other as a subnode. The symptom is that during a call PlanEnumerator::nextMemo() to advance the memo state, we get stuck in the this while() loop associated with the uppermost node.

I think the fundamental problem is that the lockstep enumeration algorithm assumes that each of its subnodes has a fixed number of enumeration states, and will consistently report when it has "wrapped around". For instance, imagine that a subnode has three enumeration states. The lockstep algorithm will only work correctly if every third recursive call to nextMemo() on this subnode returns true in order to indicate that it has wrapped around. When the subnode is itself a LockstepOrAssignment node, this assumption is violated. In particular, it is violated because the LockstepOrAssignment subnode will start returning true on every nextMemo() call as soon as it has reached internalQueryEnumerationMaxOrSolutions limit (which has a default value of 10).

I think that we should probably fix this by relaxing the lockstep enumeration algorithm to work correctly even if a subnode starts wrapping around sooner than it did on the first round of plan enumeration. That way we don't have to fundamentally change the property that each OR node in the memo only allows a maximum of 10 distinct plan enumerations.

The next steps for this ticket are to:

  • Confirm that the bug affects the master branch, and assuming that it does, alter the repro so that it works against the master branch.
  • Look into the details of a fix.
Comment by David Storch [ 10/Nov/23 ]

My script reproduces the issue on versions between 7.0.2 and 7.0.4-rc0 (the most recent 7.0.x release as of this writing). I could not reproduce it in the same way on 7.0.0, 7.0.1, or 7.2.0-rc1. This may be happenstance, or it may be that the bug really only affects certain versions on the 7.0 branch. I'll update the "affectsVersions" field accordingly for now.

EDIT: I realized that the bug does not affect 7.0.0 or 7.0.1 because lockstep enumeration was not enabled as the default behavior until 7.0.2 (see SERVER-74893). If I enable lockstep enumeration explicitly, then I can reproduce the infinite loop against those versions. However, my observation stands that 7.2.0-rc1 does not appear to be affected, even when lockstep enumeration is enabled.

Generated at Thu Feb 08 06:51:13 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.