[SERVER-50937] Make resharding coordinator support recovery Created: 14/Sep/20 Updated: 29/Oct/23 Resolved: 26/Jul/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 5.0.3, 5.1.0-rc0 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Haley Connelly | Assignee: | Randolph Tan |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | PM-234-M3, PM-234-T-lifecycle | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v5.0
|
||||||||||||||||||||||||||||||||||||||||
| Sprint: | Sharding 2021-07-12, Sharding 2021-07-26 | ||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 19 | ||||||||||||||||||||||||||||||||||||||||
| Story Points: | 3 | ||||||||||||||||||||||||||||||||||||||||
| Description |
|
Needs further investigation. Contact blake.oler before starting work on this. Notes
Out of date old descriptionIntroduce methods that can manually fulfill the promises in the ReshardingCoordinatorObserver for the recovery process. Create a flag for the ReshardingCoordinatorObserver, say _shouldObserveWrites, to prevent writes from fulfilling the observer's promises while in recovery. When in recovery (when the ReshardingCoordinatorService is constructed in state > kInitializing), construct the ReshardingCoordinatorObserver with _shouldObserveWrites to false until the ReshardingCoordinator has fully recovered and it is safe for the ReshardingCoordinatorObserver to begin observing writes again. We will flip this flag to true as a part of the recovery process (described below). Note that this means writes to config.reshardingOperations can happen before the ReshardingCoordinatorService is constructed or after its constructed but before we've done recovery. This is okay, because the coordinator will read from disk as a part of recovery (described below). At the start of ReshardingCoordinatorService::run(), if the coordinator is recovering (its state is > kInitializing) do the following in order to fulfill any promises that would have been fulfilled already had we not failed over and observe any writes that came in before we started recovery: |
| Comments |
| Comment by Vivian Ge (Inactive) [ 06/Oct/21 ] |
|
Updating the fixversion since branching activities occurred yesterday. This ticket will be in rc0 when it’s been triggered. For more active release information, please keep an eye on #server-release. Thank you! |
| Comment by Githook User [ 11/Aug/21 ] |
|
Author: {'name': 'Randolph Tan', 'email': 'randolph@10gen.com', 'username': 'renctan'}Message: (cherry picked from commit be9790dcf2de451d8e218f4471d2d8faa5f26aaa) |
| Comment by Githook User [ 11/Aug/21 ] |
|
Author: {'name': 'Randolph Tan', 'email': 'randolph@10gen.com', 'username': 'renctan'}Message: (cherry picked from commit cbddf73dc78aa6a208fe3a43ca5e8674f67d5b87) |
| Comment by Githook User [ 22/Jul/21 ] |
|
Author: {'name': 'Randolph Tan', 'email': 'randolph@10gen.com', 'username': 'renctan'}Message: |
| Comment by Githook User [ 07/Jul/21 ] |
|
Author: {'name': 'Randolph Tan', 'email': 'randolph@10gen.com', 'username': 'renctan'}Message: |
| Comment by Max Hirschhorn [ 21/Nov/20 ] |
|
I wanted to highlight a few details which will hopefully be helpful when thinking through how to implement recovery safely.
One idea and something which came up with Spencer Brody and Esha when the resharding project was first writing their primary-only services is to avoid having multiple threads write to the documents in the primary-only service-backed config.reshardingOperations collection.
Recovery for the ReshardingCoordinator would trigger the ReshardingCoordinatorObserver using the contents of the associated document in the config.reshardingOperations.participants collection. The ReshardingCoordinator would be responsible for doing this in its run() rather than being an automatic part of primary-only service rebuilding the Instance.
|