[SERVER-58592] Make ReshardingCoordinatorService more robust when stepdowns happen near the end of a resharding operation. Created: 15/Jul/21  Updated: 29/Oct/23  Resolved: 03/Aug/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 5.0.4, 5.1.0-rc0

Type: New Feature Priority: Major - P3
Reporter: Kshitij Gupta Assignee: Randolph Tan
Resolution: Fixed Votes: 0
Labels: PM-234-M3, PM-234-T-lifecycle
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Backwards Compatibility: Fully Compatible
Backport Requested:
v5.0
Sprint: Sharding 2021-07-26, Sharding 2021-08-09
Participants:
Linked BF Score: 120
Story Points: 2

 Description   

In our current implemention for the resharding coordinator, when resharding is done, we first remove the on-disk coordinator document and then clean the in-memory state (i.e completing/stepping down the metrics). This can cause issues. Consider the case in the BF. There is a stepdown after the coordinator document has been deleted but before the in-memory state has been cleaned. Since the coordinator document has been deleted, this instance is removed from the _activeInstances map in PrimaryOnlyService by the PrimaryOnlyServiceOpObserver. After this config server primary (referred to as primary_1 from here) steps down, a new primary will stepup. Since the old document and instance was deleted, this new primary won't resume the same resharding operation and will wait for the next resharding operation. When primary_1 steps up again as a primary, it will still have the not cleaned in-memory state from the original resharding operation which will conflict with the in-memory state of any new resharding operation.



 Comments   
Comment by Vivian Ge (Inactive) [ 06/Oct/21 ]

Updating the fixversion since branching activities occurred yesterday. This ticket will be in rc0 when it’s been triggered. For more active release information, please keep an eye on #server-release. Thank you!

Comment by Githook User [ 22/Sep/21 ]

Author:

{'name': 'Randolph Tan', 'email': 'randolph@10gen.com', 'username': 'renctan'}

Message: SERVER-58592 Make sure to clear resharding metrics after reshard collection completes.

(cherry picked from commit a33a04b6186ea5b56c1c9228ed19c41061f80749)
Branch: v5.0
https://github.com/mongodb/mongo/commit/d21d8ddf8301261c26410fd4e7dda8bf75d780fb

Comment by Githook User [ 02/Aug/21 ]

Author:

{'name': 'Randolph Tan', 'email': 'randolph@10gen.com', 'username': 'renctan'}

Message: SERVER-58592 Make sure to clear resharding metrics after reshard collection completes.
Branch: master
https://github.com/mongodb/mongo/commit/a33a04b6186ea5b56c1c9228ed19c41061f80749

Generated at Thu Feb 08 05:44:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.