Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-38852

Failing $merge can leave zombie writes in the cluster

    • Query Execution
    • ALL
    • Hide

      I've first reproduced the error as described in the description in the attached repro_agg_duplicate_key.js. I don't know why, but running this reproduction with

      python buildscripts/resmoke.py --suites=sharding_continuous_config_stepdown repro_agg_duplicate_key.js
      

      reproduces the failure more frequently than running with just --suites=sharding. I think possibly I am imagining this.

      Show
      I've first reproduced the error as described in the description in the attached repro_agg_duplicate_key.js . I don't know why, but running this reproduction with python buildscripts/resmoke.py --suites=sharding_continuous_config_stepdown repro_agg_duplicate_key.js reproduces the failure more frequently than running with just --suites=sharding . I think possibly I am imagining this.

      If an $merge fails on one shard, the process of cleaning up another shard's writes can leave some writes cancelled, but still very much alive in the network. Upon any error all other shards participating in an aggregation will be interrupted. Because an interrupted call to AsyncRequestsSender::next() will cancel the callbacks and return without waiting for the response of inserts or updates we've already sent, we can end up returning control to the user while there are still active writes which were generated on their behalf. This is causing problems in some of our tests, but is generally surprising and probably undesirable.

      In more detail, consider a sharded cluster which has collections test.input and test.output, both sharded by {_id: 1}. Suppose that test.input is distributed like so:

      • chunks [0,5), [15, 20) on shard 1
      • chunk [5, 15) on shard 2

      and suppose test.output is distributed like so:

      • chunk [0, 10) on shard 1
      • chunk [10, 20) on shard 2

      If we had 20 documents in test.input with _ids =[0,20) and we executed

      db.input.aggregate([{$merge: {to: "output", mode: "insertDocuments"}}])
      

      then we would end up with each shard sending 5 documents each to itself and the other shard: [0,5) would go from shard 1 to itself, [5,10) would go from shard 2 to shard 1, [10, 15) would go from shard 2 to itself, and [15, 20) would go from shard 1 to shard 2.

      Then suppose that when performing the inserts from shard 1, some of them return with an error, maybe a duplicate key. The aggregate on shard 1 will return with an error to mongos. When that error gets back to mongos, mongos will send a killCursors to the aggregate (with the $merge) running on shard 2. Upon receiving that killCursors, the $merge stage may be in a state where it has scheduled both the writes but hasn't heard back from one or both of them. In this scenario, the AsyncRequestsSender will cancel its callbacks as described above and then the aggregate and its cursor will be destroyed and we will respond to mongos and then to the user with the error without verifying that the writes have finished (successfully or not).

      In some of our tests, we perform an $merge which is expected to fail after partially completing, then issue a subsequent db.target.remove({}); and expecting that remove to get rid of everything in the collection. Because there can be lingering writes from the $merge, this remove may succeed but there will very shortly be some documents inserted into the collection, causing a subsequent assertion to fail.

      I've also attached repro_no_agg.js which demonstrates that an unordered bulk insert can also leave an active insert in the cluster since it also uses the AsyncRequestSender, though this is less surprising and I think not a bug. It is not possible to leave an active operation without being interrupted. Triggering an interruption due to an error like DuplicateKey is unique to the aggregation system.

        1. repro_agg_duplicate_key.js
          2 kB
        2. repro_no_agg.js
          2 kB
        3. zombie_killer.patch
          15 kB

            Assignee:
            backlog-query-execution [DO NOT USE] Backlog - Query Execution
            Reporter:
            charlie.swanson@mongodb.com Charlie Swanson
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated: