[SERVER-50146] Removing a shard with 'uncommitted' documents in config.rangeDeletions on migration recipient can lead to incomplete state on donor Created: 06/Aug/20 Updated: 26/Oct/23 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Matthew Saltz (Inactive) | Assignee: | Backlog - Catalog and Routing |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | oldshardingemea | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Assigned Teams: |
Catalog and Routing
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
The following scenario can occur:
|
| Comments |
| Comment by Esha Maharishi (Inactive) [ 07/Aug/20 ] |
|
Note that the MigrationCoordinator only retries those two commands a fixed number of times, since they are not wrapped with retryIdempotentWorkAsPrimaryUntilSuccessOrStepdown. This means the donor can also be left with orphans on itself if it gets repeated network errors in step 3. As a stop-gap, the donor should (1) update its own range deletion task before the recipient's, and (2) retry those two commands until success or ShardNotFound. The ShardNotFound issue is analogous to |