[SERVER-55557] Range deletion of aborted migration can fail after a refine shard key Created: 26/Mar/21  Updated: 29/Oct/23  Resolved: 28/May/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 5.0.4, 5.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Jordi Serra Torrens Assignee: Jordi Serra Torrens
Resolution: Fixed Votes: 0
Labels: sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Duplicate
duplicates SERVER-52906 moveChunk after failed migration that... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.0
Sprint: Sharding EMEA 2021-05-31
Participants:
Linked BF Score: 27

 Description   

At the end of _configSvrRefineCollectionShardKey it triggers a best-effort fire-and-forget refresh to the shards that own chunks. It's best effort, so it is not guaranteed that the shards will actually refresh.

Consider a shard that had cached metadata for the collection, but had not successfully refreshed after the refineCollectionShardKey. If this shard is later a recipient of a chunk migration that gets aborted, when this shard goes to execute the range deletion, it will believe the collection still has the old shard key. However, the range boundaries in the task are with the new refined shard key. So this call to KeyPattern::extendRangeBound will fail here



 Comments   
Comment by Githook User [ 27/Oct/21 ]

Author:

{'name': 'Jordi Serra Torrens', 'email': 'jordi.serra-torrens@mongodb.com', 'username': 'jordist'}

Message: SERVER-55557 Range deletion of aborted migration can fail after a refine shard key
Branch: v5.0
https://github.com/mongodb/mongo/commit/b5a5e2b2099034a712098939bfc153dfcd33dd2d

Comment by Vivian Ge (Inactive) [ 06/Oct/21 ]

Updating the fixversion since branching activities occurred yesterday. This ticket will be in rc0 when it’s been triggered. For more active release information, please keep an eye on #server-release. Thank you!

Comment by Githook User [ 28/May/21 ]

Author:

{'name': 'Jordi Serra Torrens', 'email': 'jordi.serra-torrens@mongodb.com', 'username': 'jordist'}

Message: SERVER-55557 Range deletion of aborted migration can fail after a refine shard key
Branch: master
https://github.com/mongodb/mongo/commit/2e0261a28fc78379a0394ee5925d409ae8988af9

Comment by Esha Maharishi (Inactive) [ 05/Apr/21 ]

This is a duplicate of SERVER-52906, but we may want to close SERVER-52906 and keep this ticket open instead, since this ticket discusses potential solutions.

Comment by Jordi Serra Torrens [ 26/Mar/21 ]

A couple of alternatives on how to address this:

a) We could catch this error and refresh the metadata, so the next time this range deletion task is retried it will know of the new shard key. 

b) We could make the shards refresh triggered by _configsvrRefineCollectionShardKey be for correctness (instead of best-effort fire-and-forget), and ensure that the shards successfully refresh and flush the refresh with majority write concern before returning from _configsvrRefineCollectionShardKey

Generated at Thu Feb 08 05:36:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.