[SERVER-50146] Removing a shard with 'uncommitted' documents in config.rangeDeletions on migration recipient can lead to incomplete state on donor Created: 06/Aug/20  Updated: 26/Oct/23

Status: Backlog
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Matthew Saltz (Inactive) Assignee: Backlog - Catalog and Routing
Resolution: Unresolved Votes: 0
Labels: oldshardingemea
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-38918 Coordinator should make configOpTime ... Closed
is related to SERVER-50144 Removing a shard with in-progress mig... Backlog
Assigned Teams:
Catalog and Routing
Operating System: ALL
Participants:

 Description   

The following scenario can occur:

  1. Shard X migrates a chunk to shard Y and completes
  2. At some point before the donor deletes the config.rangeDeletions document on the recipient, shard Y migrates that same chunk to some other shard and then gets removed
  3. Shard X receives ShardNotFound for either of these commands on the recipient and never updates its local config.rangeDeletions document. This will repeat even after failover, leading to permanent orphans and the inability to migrate an overlapping chunk back to shard X


 Comments   
Comment by Esha Maharishi (Inactive) [ 07/Aug/20 ]

Note that the MigrationCoordinator only retries those two commands a fixed number of times, since they are not wrapped with retryIdempotentWorkAsPrimaryUntilSuccessOrStepdown.

This means the donor can also be left with orphans on itself if it gets repeated network errors in step 3.

As a stop-gap, the donor should (1) update its own range deletion task before the recipient's, and (2) retry those two commands until success or ShardNotFound.

The ShardNotFound issue is analogous to SERVER-38918 and should be handled the same way.

Generated at Thu Feb 08 05:21:52 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.