[SERVER-48198] Migration recovery may recover incorrect decision after shard key refine Created: 13/May/20  Updated: 29/Oct/23  Resolved: 14/May/20

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 4.4.0-rc7, 4.7.0

Type: Bug Priority: Major - P3
Reporter: Jack Mulrow Assignee: Jack Mulrow
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
depends on SERVER-46386 Refining a shard key may lead to an o... Closed
Duplicate
is duplicated by SERVER-48209 Migration coordinator step-up recover... Closed
Related
related to SERVER-48242 Make stepping up secondary in range_d... Closed
related to SERVER-48246 Wait for RSM to detect failover in ra... Closed
is related to SERVER-45983 Perform the shardVersion recovery and... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4
Sprint: Sharding 2020-05-18
Participants:
Linked BF Score: 28

 Description   

When a new primary steps up in a shardsvr replica set, it launches a task to recover any migrations driven by the node's shard that were in-progress when the previous primary stepped down. As part of this, the recovery process will recover the outcome of each migration by loading the latest metadata from the config server and checking if the minimum bound from the migration still belongs to the donor shard. If it does, the migration is assumed to have aborted, and the recovery process updates the persisted range deleter state on the donor and recipient shards so any orphans on either are deleted.

If an interrupted migration committed successfully and its namespace had its shard key refined before the recovery process runs, the check for ownership will use the pre-refine minimum boundary but a post-refine routing table. This may result in a spurious overlap, which leads the recovery process to incorrectly decide the migration aborted, preventing any orphans on the donor from being cleaned up. The recipient will attempt to schedule a range deletion for the received range, which will fail with RangeOverlapConflict.

To fix this, the recovery process should extend the migration's min bound when performing the ownership check if the most recent shard key has more fields, like what was done inĀ SERVER-46386.



 Comments   
Comment by Githook User [ 15/May/20 ]

Author:

{'name': 'Jack Mulrow', 'email': 'jack.mulrow@mongodb.com', 'username': 'jsmulrow'}

Message: SERVER-48198 Account for extended range bounds when recovering migration decision

(cherry picked from commit 9d8eb69d583b89682520ec58595e558d5f6cc9a2)
Branch: v4.4
https://github.com/mongodb/mongo/commit/7d4d1ebeaeee37d743ad65099702bf27a12d7d33

Comment by Githook User [ 14/May/20 ]

Author:

{'name': 'Jack Mulrow', 'email': 'jack.mulrow@mongodb.com', 'username': 'jsmulrow'}

Message: SERVER-48198 Account for extended range bounds when recovering migration decision
Branch: master
https://github.com/mongodb/mongo/commit/9d8eb69d583b89682520ec58595e558d5f6cc9a2

Generated at Thu Feb 08 05:16:25 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.