[SERVER-38852] Failing $merge can leave zombie writes in the cluster Created: 04/Jan/19  Updated: 08/Sep/23

Status: Backlog
Project: Core Server
Component/s: Aggregation Framework, Write Ops
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Charlie Swanson Assignee: Backlog - Query Execution
Resolution: Unresolved Votes: 0
Labels: query-44-grooming
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File repro_agg_duplicate_key.js     File repro_no_agg.js     File zombie_killer.patch    
Issue Links:
Related
is related to SERVER-43198 Zombie writes from failing $merge sho... Backlog
is related to SERVER-80853 $out on secondary node can produce in... In Code Review
is related to SERVER-43851 Work around zombie writes in $merge t... Closed
Assigned Teams:
Query Execution
Operating System: ALL
Steps To Reproduce:

I've first reproduced the error as described in the description in the attached repro_agg_duplicate_key.js. I don't know why, but running this reproduction with

python buildscripts/resmoke.py --suites=sharding_continuous_config_stepdown repro_agg_duplicate_key.js

reproduces the failure more frequently than running with just --suites=sharding. I think possibly I am imagining this.

Participants:

 Description   

If an $merge fails on one shard, the process of cleaning up another shard's writes can leave some writes cancelled, but still very much alive in the network. Upon any error all other shards participating in an aggregation will be interrupted. Because an interrupted call to AsyncRequestsSender::next() will cancel the callbacks and return without waiting for the response of inserts or updates we've already sent, we can end up returning control to the user while there are still active writes which were generated on their behalf. This is causing problems in some of our tests, but is generally surprising and probably undesirable.

In more detail, consider a sharded cluster which has collections test.input and test.output, both sharded by {_id: 1}. Suppose that test.input is distributed like so:

  • chunks [0,5), [15, 20) on shard 1
  • chunk [5, 15) on shard 2

and suppose test.output is distributed like so:

  • chunk [0, 10) on shard 1
  • chunk [10, 20) on shard 2

If we had 20 documents in test.input with _ids =[0,20) and we executed

db.input.aggregate([{$merge: {to: "output", mode: "insertDocuments"}}])

then we would end up with each shard sending 5 documents each to itself and the other shard: [0,5) would go from shard 1 to itself, [5,10) would go from shard 2 to shard 1, [10, 15) would go from shard 2 to itself, and [15, 20) would go from shard 1 to shard 2.

Then suppose that when performing the inserts from shard 1, some of them return with an error, maybe a duplicate key. The aggregate on shard 1 will return with an error to mongos. When that error gets back to mongos, mongos will send a killCursors to the aggregate (with the $merge) running on shard 2. Upon receiving that killCursors, the $merge stage may be in a state where it has scheduled both the writes but hasn't heard back from one or both of them. In this scenario, the AsyncRequestsSender will cancel its callbacks as described above and then the aggregate and its cursor will be destroyed and we will respond to mongos and then to the user with the error without verifying that the writes have finished (successfully or not).

In some of our tests, we perform an $merge which is expected to fail after partially completing, then issue a subsequent db.target.remove({}); and expecting that remove to get rid of everything in the collection. Because there can be lingering writes from the $merge, this remove may succeed but there will very shortly be some documents inserted into the collection, causing a subsequent assertion to fail.

I've also attached repro_no_agg.js which demonstrates that an unordered bulk insert can also leave an active insert in the cluster since it also uses the AsyncRequestSender, though this is less surprising and I think not a bug. It is not possible to leave an active operation without being interrupted. Triggering an interruption due to an error like DuplicateKey is unique to the aggregation system.



 Comments   
Comment by Charlie Swanson [ 06/Sep/19 ]

david.storch I filed SERVER-43198 to track the work that is actually feasible. I'm not sure what we should do with this ticket but it does look like all the linked BFs are instances of a remove({}) followed by lingering inserts/upserts rather than a drop and implicit recreate which SERVER-43198 would help with. We haven't seen any of these failures recently though so it's worth discussion whether we should mark them trivial, investigate why they stopped happening, or something else.

Comment by Charlie Swanson [ 05/Sep/19 ]

Charlie to do some JIRA paperwork tomorrow to make it clear that we're interested in fixing zombie writes re-creating collections, but are not pursuing a fix for the more generic problem of leftover work from a cancelled operation.

Comment by Charlie Swanson [ 23/Aug/19 ]

Didn't get as much time to work on this today since I was mostly preoccupied by BF-14452. I did try to run the reproduction script but it no longer works. I'd like to keep looking into that next time.

Comment by Charlie Swanson [ 22/Aug/19 ]

I did get to work on this last Friday. I have a patch that I think will address the problem and hopefully is something we can do for the 4.2 branch as well. It's a little unpolished at this point and I still haven't reproduced this problem so I have no evidence that it actually works, but I've attached what I did so far in zombie_killer.patch.

After just attaching that I see there are reproduction scripts attached to this ticket (which I apparently made??) so I'd like to start there and continue tomorrow.

Comment by Charlie Swanson [ 08/Aug/19 ]

After discussing in-person, we believe there's probably a bug where $out and $merge are setting allowImplicitCreate when they should not be. I'm assigning to myself to investigate.

Comment by Mira Carey [ 26/Jul/19 ]

We were thinking of circumventing maxTimeMs and interrupts.

That's going to be a non-starter. It's enormously problematic to make uninterruptible opCtxs, especially one's waiting for unbounded amounts of time. Even if we want to wait somewhere else farther up the stack (and return different errors to users), you have to unwind your stack (to allow for the release of locks that might be needed to make progress on stepDown).

Comment by Jacob Evans [ 26/Jul/19 ]

After discussing this we are now considering adding an option to aggregate (and potentially all operations in the long-term) to cause the client operation itself to wait for all children to die before returning control. This would be very useful to our tests since we'd obviate the need to check currentOp to ensure the operation is really over. We also think it might be useful since it provides these optional guarantees to other clients if they're willing to wait. Let us know what you think.

With that out of the way, let me answer open questions:
The top-level $merge does correspond to the user's command and we would like to wait.
We were thinking of circumventing maxTimeMs and interrupts.
Holding on to sockets seems like the simple implementation.
If a connection dies, returning an error through the parent operation seems appropriate and sufficient.
The ability to actively kill child ops would be an improvement but just waiting is all we have planned at the moment. The goal is to at least have the parent operation correctly reflect that it isn't completely done.
Although a timeframe would be sufficient to make the tests fail less, we are wary of any time-based approaches in testing since they usually lead to sporadic failures.

Comment by Mira Carey [ 23/Jul/19 ]

A couple of questions:

  • How are you going to wait around for your children? Is it entirely by holding onto sockets with outstanding requests?
    • Would it be a problem if one of those connections had a networking error (but kept persisting on the other side)
    • Do you intend to actively kill those ops? Or just wait for them to complete?
  • How important is it that you know that all writes initiated by a command are complete before you return? Would it suffice to have those writes cancelled within a time frame?
    • If so, you might be able to leverage the markKillOnClientDisconnect code to make these kinds of subsidiary operations automatically clean up on cancellation. That would get you eventual, guaranteed, cleanup
Comment by Kaloian Manassiev [ 23/Jul/19 ]

The way I understand is that this is a manifestation of the generic problem about how to cancel a "tree" of distributed operations in the cluster. Your proposal to leave the $merge parent active until all its children have completed seems like the architecturally correct way to achieve this - each parent is responsible for killing and waiting for clean-up of its children.

A couple of considerations:

  • Does the top-level $merge correspond to the user command against MongoS and is it possible, if the operation context holding it is killed, to give it out to an internal thread to complete the joining of the children instead of doing it on the user thread?
  • Since ClusterWriter is not cancelable cleanly (i.e., if you have a write batch executing, there is no killCursors-equivalent through the cluster writer to clean its state on the remote node, only to stop waiting for it), is your plan to just keep waiting for it to return? Because this will not work if you are doing it on the client's operation context and it gets interrupted or its deadline expires.
Comment by Jacob Evans [ 12/Jul/19 ]

The solution we've settled on after discussion is to have a killed $merge persist in a dead state until all children are complete. Although this doesn't actually allow the command to be swiftly stopped, it leaves the parent $merge around as an indicator that the operation has not yet gone away. At least we would be more accurately reporting the current state of the cluster. This is also something our tests can rely on.

We've decided to solve this in the general case for other bulk operations if possible and if little extra work.

In order to facilitate this we foresee the need to change the ClusterWriter and possibly the AsyncRequestHandler so that parent operations can track the lifetime of their children.

Comment by Charlie Swanson [ 04/Jan/19 ]

I'm leaving this in Needs Triage because I'm not sure whether we should work around this in our tests or add some customization to the AsyncRequestsSender to avoid leaving things around when interrupted for agg's purposes. As of this writing, I think we should probably fix the server.

Generated at Thu Feb 08 04:50:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.