-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: 6.0.6, 6.3.1, 5.0.18, 7.0.0-rc2
-
Component/s: Sharding
-
Sharding EMEA
-
Fully Compatible
-
ALL
-
v7.1, v7.0, v6.0, v5.0
-
Sharding EMEA 2023-06-26, Sharding EMEA 2023-07-10, Sharding EMEA 2023-07-24, Sharding EMEA 2023-08-07, Sharding EMEA 2023-08-21, Sharding EMEA 2023-09-04, Sharding EMEA 2023-09-18, Sharding EMEA 2023-10-02, Sharding EMEA 2023-10-16
-
0
-
3
When stopping migrations on a sharded collection being renamed, the flow leads to a refresh on every shard in order for them to discover the stopMigrations flag and abort ongoing migrations before returning.
However, in case of donor step-down right at the end of a refresh, it may happen that the refresh succeeds even though the abortion has failed: this wait for abortion never throws because the migration source manager doesn't invalidate the future in case of error. This means that the refresh spawned by stopMigration succeeds and the coordinator can proceed with the next phase before the abortion completes by locally deleting the range deletion document and flagging the range deletion task as ready on the recipient side.
This is problematic because:
- When snapshotting range deletions a rename participant may end up copying a document flagged as pending right before the migration deletes it (in case of donor) or unflags it (in case of recipient)
- The pending task would then be restored in the next phase
The result is that:
- On the donor side: if the range deletion document gets deleted right between step 1 and 2, the document restored at 2 would forever be marked as "pending".
- On the recipient side: if the range deletion happens to be executed right between step 1 and 2, the document restored at 2 would forever be marked as "pending".
At the time of writing, this ticket affects all versions supporting sharded rename, hence all versions >= v5.0.0