[SERVER-43202] Aggregation system can continue trying to execute a query plan after being interrupted, leading to server-fatal invariant failure Created: 06/Sep/19 Updated: 29/Oct/23 Resolved: 02/Oct/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Aggregation Framework, Querying |
| Affects Version/s: | 4.2.0 |
| Fix Version/s: | 4.2.1, 4.3.1 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | David Storch | Assignee: | Ian Boros |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | CTSA, KP42 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v4.2
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Steps To Reproduce: | I was able to reproduce this failure by first instrumenting the code with the following patch. The patch adds a bit of logging, but most importantly causes every query yield point to result in a ClientDisconnect error:
Next, I ran the following script against a build of the instrumented server. This configures the server to yield frequently, and then runs an aggregate command which exercises the flawed code path.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Query 2019-09-09, Query 2019-09-23, Query 2019-10-07 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Participants: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 11 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
The aggregation subsystem may invoke the underlying multi-planning system multiple times in order to select its execution plan. For example, when there is a $sort it first attempts to construct a non-blocking $sort plan. If no such plan exists, it then invokes the plan selection code again without pushing down the $sort to the underlying PlanStage layer. See the code implementing this logic here. If during this process the operation is killed—because the client disconnected, or due to an explicit killOp command, or due to a conflicting catalog event such as a collection drop—then this process should terminate at the next yield point and the operation should stop running. However, there is incorrect logic that accidentally swallows the interruption error and attempts to keep executing the aggregation operation. Specifically, we have checks to propagate the error code QueryPlanKilled which misses other possible error codes, e.g. ClientDisconnect. Since an interrupt check happens at query yield points, the interruption error is propagated after locks have been relinquished. The consequence is that we attempt to keep executing the aggregation operation without holding the proper locks. This promptly triggers the following server-fatal invariant check, which makes an assertion that the necessary locks are held: When this scenario occurs, the server will shut down and and the logs will contain the following message, along with a backtrace:
|
| Comments |
| Comment by Githook User [ 04/Oct/19 ] |
|
Author: {'name': 'Ian Boros', 'username': 'puppyofkosh', 'email': 'ian.boros@mongodb.com'}Message: |
| Comment by Githook User [ 01/Oct/19 ] |
|
Author: {'username': 'puppyofkosh', 'email': 'ian.boros@mongodb.com', 'name': 'Ian Boros'}Message: |