-
Type: Bug
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Aggregation Framework, Write Ops
-
Query Execution
-
ALL
-
-
0
If an $merge fails on one shard, the process of cleaning up another shard's writes can leave some writes cancelled, but still very much alive in the network. Upon any error all other shards participating in an aggregation will be interrupted. Because an interrupted call to AsyncRequestsSender::next() will cancel the callbacks and return without waiting for the response of inserts or updates we've already sent, we can end up returning control to the user while there are still active writes which were generated on their behalf. This is causing problems in some of our tests, but is generally surprising and probably undesirable.
In more detail, consider a sharded cluster which has collections test.input and test.output, both sharded by {_id: 1}. Suppose that test.input is distributed like so:
- chunks [0,5), [15, 20) on shard 1
- chunk [5, 15) on shard 2
and suppose test.output is distributed like so:
- chunk [0, 10) on shard 1
- chunk [10, 20) on shard 2
If we had 20 documents in test.input with _ids =[0,20) and we executed
db.input.aggregate([{$merge: {to: "output", mode: "insertDocuments"}}])
then we would end up with each shard sending 5 documents each to itself and the other shard: [0,5) would go from shard 1 to itself, [5,10) would go from shard 2 to shard 1, [10, 15) would go from shard 2 to itself, and [15, 20) would go from shard 1 to shard 2.
Then suppose that when performing the inserts from shard 1, some of them return with an error, maybe a duplicate key. The aggregate on shard 1 will return with an error to mongos. When that error gets back to mongos, mongos will send a killCursors to the aggregate (with the $merge) running on shard 2. Upon receiving that killCursors, the $merge stage may be in a state where it has scheduled both the writes but hasn't heard back from one or both of them. In this scenario, the AsyncRequestsSender will cancel its callbacks as described above and then the aggregate and its cursor will be destroyed and we will respond to mongos and then to the user with the error without verifying that the writes have finished (successfully or not).
In some of our tests, we perform an $merge which is expected to fail after partially completing, then issue a subsequent db.target.remove({}); and expecting that remove to get rid of everything in the collection. Because there can be lingering writes from the $merge, this remove may succeed but there will very shortly be some documents inserted into the collection, causing a subsequent assertion to fail.
I've also attached repro_no_agg.js which demonstrates that an unordered bulk insert can also leave an active insert in the cluster since it also uses the AsyncRequestSender, though this is less surprising and I think not a bug. It is not possible to leave an active operation without being interrupted. Triggering an interruption due to an error like DuplicateKey is unique to the aggregation system.
- is related to
-
SERVER-80853 $out on secondary node can produce incorrect results if primary steps down
- Backlog
-
SERVER-43198 Zombie writes from failing $merge should not be able to re-create a collection
- Backlog
-
SERVER-43851 Work around zombie writes in $merge tests
- Closed
- related to
-
SERVER-88154 Interrupting $out on sharded cluster can leave temp collections
- Closed