[SERVER-38852] Failing $merge can leave zombie writes in the cluster Created: 04/Jan/19 Updated: 08/Sep/23 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Aggregation Framework, Write Ops |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Charlie Swanson | Assignee: | Backlog - Query Execution |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | query-44-grooming | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Assigned Teams: |
Query Execution
|
||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Steps To Reproduce: | I've first reproduced the error as described in the description in the attached repro_agg_duplicate_key.js
reproduces the failure more frequently than running with just --suites=sharding. I think possibly I am imagining this. |
||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
If an $merge fails on one shard, the process of cleaning up another shard's writes can leave some writes cancelled, but still very much alive in the network. Upon any error all other shards participating in an aggregation will be interrupted. Because an interrupted call to AsyncRequestsSender::next() will cancel the callbacks and return without waiting for the response of inserts or updates we've already sent, we can end up returning control to the user while there are still active writes which were generated on their behalf. This is causing problems in some of our tests, but is generally surprising and probably undesirable. In more detail, consider a sharded cluster which has collections test.input and test.output, both sharded by {_id: 1}. Suppose that test.input is distributed like so:
and suppose test.output is distributed like so:
If we had 20 documents in test.input with _ids =[0,20) and we executed
then we would end up with each shard sending 5 documents each to itself and the other shard: [0,5) would go from shard 1 to itself, [5,10) would go from shard 2 to shard 1, [10, 15) would go from shard 2 to itself, and [15, 20) would go from shard 1 to shard 2. Then suppose that when performing the inserts from shard 1, some of them return with an error, maybe a duplicate key. The aggregate on shard 1 will return with an error to mongos. When that error gets back to mongos, mongos will send a killCursors to the aggregate (with the $merge) running on shard 2. Upon receiving that killCursors, the $merge stage may be in a state where it has scheduled both the writes but hasn't heard back from one or both of them. In this scenario, the AsyncRequestsSender will cancel its callbacks as described above and then the aggregate and its cursor will be destroyed and we will respond to mongos and then to the user with the error without verifying that the writes have finished (successfully or not). In some of our tests, we perform an $merge which is expected to fail after partially completing, then issue a subsequent db.target.remove({}); and expecting that remove to get rid of everything in the collection. Because there can be lingering writes from the $merge, this remove may succeed but there will very shortly be some documents inserted into the collection, causing a subsequent assertion to fail. I've also attached repro_no_agg.js |
| Comments |
| Comment by Charlie Swanson [ 06/Sep/19 ] |
|
david.storch I filed SERVER-43198 to track the work that is actually feasible. I'm not sure what we should do with this ticket but it does look like all the linked BFs are instances of a remove({}) followed by lingering inserts/upserts rather than a drop and implicit recreate which SERVER-43198 would help with. We haven't seen any of these failures recently though so it's worth discussion whether we should mark them trivial, investigate why they stopped happening, or something else. |
| Comment by Charlie Swanson [ 05/Sep/19 ] |
|
Charlie to do some JIRA paperwork tomorrow to make it clear that we're interested in fixing zombie writes re-creating collections, but are not pursuing a fix for the more generic problem of leftover work from a cancelled operation. |
| Comment by Charlie Swanson [ 23/Aug/19 ] |
|
Didn't get as much time to work on this today since I was mostly preoccupied by BF-14452. I did try to run the reproduction script but it no longer works. I'd like to keep looking into that next time. |
| Comment by Charlie Swanson [ 22/Aug/19 ] |
|
I did get to work on this last Friday. I have a patch that I think will address the problem and hopefully is something we can do for the 4.2 branch as well. It's a little unpolished at this point and I still haven't reproduced this problem so I have no evidence that it actually works, but I've attached what I did so far in zombie_killer.patch After just attaching that I see there are reproduction scripts attached to this ticket (which I apparently made??) so I'd like to start there and continue tomorrow. |
| Comment by Charlie Swanson [ 08/Aug/19 ] |
|
After discussing in-person, we believe there's probably a bug where $out and $merge are setting allowImplicitCreate when they should not be. I'm assigning to myself to investigate. |
| Comment by Mira Carey [ 26/Jul/19 ] |
That's going to be a non-starter. It's enormously problematic to make uninterruptible opCtxs, especially one's waiting for unbounded amounts of time. Even if we want to wait somewhere else farther up the stack (and return different errors to users), you have to unwind your stack (to allow for the release of locks that might be needed to make progress on stepDown). |
| Comment by Jacob Evans [ 26/Jul/19 ] |
|
After discussing this we are now considering adding an option to aggregate (and potentially all operations in the long-term) to cause the client operation itself to wait for all children to die before returning control. This would be very useful to our tests since we'd obviate the need to check currentOp to ensure the operation is really over. We also think it might be useful since it provides these optional guarantees to other clients if they're willing to wait. Let us know what you think. With that out of the way, let me answer open questions: |
| Comment by Mira Carey [ 23/Jul/19 ] |
|
A couple of questions:
|
| Comment by Kaloian Manassiev [ 23/Jul/19 ] |
|
The way I understand is that this is a manifestation of the generic problem about how to cancel a "tree" of distributed operations in the cluster. Your proposal to leave the $merge parent active until all its children have completed seems like the architecturally correct way to achieve this - each parent is responsible for killing and waiting for clean-up of its children. A couple of considerations:
|
| Comment by Jacob Evans [ 12/Jul/19 ] |
|
The solution we've settled on after discussion is to have a killed $merge persist in a dead state until all children are complete. Although this doesn't actually allow the command to be swiftly stopped, it leaves the parent $merge around as an indicator that the operation has not yet gone away. At least we would be more accurately reporting the current state of the cluster. This is also something our tests can rely on. We've decided to solve this in the general case for other bulk operations if possible and if little extra work. In order to facilitate this we foresee the need to change the ClusterWriter and possibly the AsyncRequestHandler so that parent operations can track the lifetime of their children. |
| Comment by Charlie Swanson [ 04/Jan/19 ] |
|
I'm leaving this in Needs Triage because I'm not sure whether we should work around this in our tests or add some customization to the AsyncRequestsSender to avoid leaving things around when interrupted for agg's purposes. As of this writing, I think we should probably fix the server. |