-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: 7.3.3, 7.0.12, 8.0.0-rc10
-
Component/s: None
-
None
-
Catalog and Routing
-
Fully Compatible
-
ALL
-
v8.0, v7.3, v7.0
-
CAR Team 2024-07-08, CAR Team 2024-07-22
-
0
When a chunk migration happens, on the recipient side we wait for ongoing range deletions on overlapping ranges before persisting a range deletion document.
But on the donor side we assume that no range deletion document exists locally for the range being moved.
That's a wrong assumption because the following could happen:
- A migration from shardA to shardB commits on the CSRS but gets interrupted right before deleting the pending range deletion task on the recipient shard. Shard A will need to recover it and set things right on the range deletion document.
- shardB migrates to shardC the newly received chunk shard right away, flagging a range deletion task matching the moved range as ready (relying on collection uuid + boundaries). This can result in flagging as ready the pending range deletion task from the migration happened at (1).
- shardA recovers the migration and deletes the range deletion task on shardB (relying on migration id + collection id + boundaries). This is a no-op because of (2).
As a result, the range deletion task for the migration that happened at (2) will stay flagged as pending forever.
The consequence is that no range overlapping with the pending range deletion task will be ever moved back. In the worse case, this may result in the balancer be unable to migrate chunks from shardC to shardB due to the chunk selection policy (always picking the lower chunk from the donor shard).
- is caused by
-
SERVER-69586 Make update/delete of range deletion document on recipient idempotent
- Closed
- is related to
-
SERVER-92381 Ensure MigrationSourceManager fulfills its promise when aborting in early stages
- Closed