Observed in mr_output_options.js where coll.remove({}) completed successfully, but then coll.find().itcount() != 0 (with no concurrent inserts). This is in contrast to running coll.remove({}) on a standalone or replica set, or on a sharded cluster in the absence of a concurrent chunk migration.
Sequence of events is:
- Start chunk migration from shardA to shardB.
- After the range deletion on the recipient (shardB), but before the clone starts, the mongos gets coll.remove({}), and broadcasts it unversioned to both shards.
- shardB finishes that deletion quickly. shardB now has 0 documents in coll.
- Meanwhile, shardA has started processing the multi-delete, but is working on other documents, not those in the chunk range being moved.
- Now the clone of documents from shardA to shardB happens (starts and completes). shardB now has non-zero documents (the contents of the chunk being moved).
- The migration enters the critical section to commit, interrupting the multi-delete on shardA with StaleConfig "migration commit in progress for dbname.collname".
- The migration gets the final xfermods from the donor's OpObsever inside the critical section, but because the multi-delete on shardA hasn't yet gotten to any of the chunk range documents, there are no mods to apply. The migration finishes normally.
- In the meantime, the mongos received StaleConfig from the multi-delete on shardA, so it has resent the multi-delete but only to shardA. It blocks until the critical section exits, then runs normally to successful completion. The mongos multi-delete command now also completes successfully. shardA now has 0 documents, but shardB still has the documents from the migrated chunk.
- related to
-
SERVER-47371 Chunk migration concurrent with multi-delete can cause matching documents to not be deleted
- Backlog