[SERVER-46211] Chunk migration concurrent with multi-delete can cause matching documents to not be deleted Created: 17/Feb/20  Updated: 29/Oct/23  Resolved: 06/Apr/20

Status: Closed
Project: Core Server
Component/s: Querying, Sharding
Affects Version/s: None
Fix Version/s: 4.4.0-rc0, 4.7.0

Type: Bug Priority: Major - P3
Reporter: Kevin Pulo Assignee: Randolph Tan
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
related to SERVER-47371 Chunk migration concurrent with multi... Backlog
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4
Participants:
Linked BF Score: 20

 Description   

Observed in mr_output_options.js where coll.remove({}) completed successfully, but then coll.find().itcount() != 0 (with no concurrent inserts). This is in contrast to running coll.remove({}) on a standalone or replica set, or on a sharded cluster in the absence of a concurrent chunk migration.

Sequence of events is:

  1. Start chunk migration from shardA to shardB.

  2. After the range deletion on the recipient (shardB), but before the clone starts, the mongos gets coll.remove({}), and broadcasts it unversioned to both shards.

  3. shardB finishes that deletion quickly. shardB now has 0 documents in coll.

  4. Meanwhile, shardA has started processing the multi-delete, but is working on other documents, not those in the chunk range being moved.

  5. Now the clone of documents from shardA to shardB happens (starts and completes). shardB now has non-zero documents (the contents of the chunk being moved).

  6. The migration enters the critical section to commit, interrupting the multi-delete on shardA with StaleConfig "migration commit in progress for dbname.collname".

  7. The migration gets the final xfermods from the donor's OpObsever inside the critical section, but because the multi-delete on shardA hasn't yet gotten to any of the chunk range documents, there are no mods to apply. The migration finishes normally.

  8. In the meantime, the mongos received StaleConfig from the multi-delete on shardA, so it has resent the multi-delete but only to shardA. It blocks until the critical section exits, then runs normally to successful completion. The mongos multi-delete command now also completes successfully. shardA now has 0 documents, but shardB still has the documents from the migrated chunk.


 Comments   
Comment by Max Hirschhorn [ 06/Apr/20 ]

SERVER-47371 is where we'll continue to track this issue. Resolving this ticket and its 4.4 backport to track only the test changes made to mr_output_options.js.

Comment by Githook User [ 25/Mar/20 ]

Author:

{'name': 'Randolph Tan', 'username': 'renctan', 'email': 'randolph@10gen.com'}

Message: SERVER-46211 Disable balancer on mr_output_options.js

(cherry picked from commit a6350e13bb18eab3b00624d8c8d82e550a382906)
Branch: v4.4
https://github.com/mongodb/mongo/commit/832835b16c0ad610812d7a7c22df5ce3dc92da5a

Comment by Githook User [ 17/Mar/20 ]

Author:

{'email': 'randolph@10gen.com', 'name': 'Randolph Tan', 'username': 'renctan'}

Message: SERVER-46211 Disable balancer on mr_output_options.js
Branch: master
https://github.com/mongodb/mongo/commit/a6350e13bb18eab3b00624d8c8d82e550a382906

Generated at Thu Feb 08 05:10:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.