-
Type: Task
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Sharding
-
Fully Compatible
-
v5.0
-
Sharding 2021-07-12, Sharding 2021-07-26
-
19
-
3
Needs further investigation. Contact blake.oler before starting work on this.
Notes
- It's okay for onCommit handlers to be run out of order if they have attached opTimes.
- How do we ensure that the resharding coordinator is always using the latest version of the document to replace the contents on disk?
- Is ensuring that promises are fulfilled on recovery as simple as updating the in-memory document after checking the future of the first promise? Would it be simpler to do away completely with an in-memory representation of the underlying document?
- Need to make sure that we don't write to the temporary resharding collection entry when it should have already been removed.
Out of date old description
Introduce methods that can manually fulfill the promises in the ReshardingCoordinatorObserver for the recovery process.
Create a flag for the ReshardingCoordinatorObserver, say _shouldObserveWrites, to prevent writes from fulfilling the observer's promises while in recovery.
When in recovery (when the ReshardingCoordinatorService is constructed in state > kInitializing), construct the ReshardingCoordinatorObserver with _shouldObserveWrites to false until the ReshardingCoordinator has fully recovered and it is safe for the ReshardingCoordinatorObserver to begin observing writes again. We will flip this flag to true as a part of the recovery process (described below). Note that this means writes to config.reshardingOperations can happen before the ReshardingCoordinatorService is constructed or after its constructed but before we've done recovery. This is okay, because the coordinator will read from disk as a part of recovery (described below).
At the start of ReshardingCoordinatorService::run(), if the coordinator is recovering (its state is > kInitializing) do the following in order to fulfill any promises that would have been fulfilled already had we not failed over and observe any writes that came in before we started recovery:
1. Take the collection lock in mode S
2. Read config.reshardingOperations for this resharding op
3. Inspect the doc and fulfill any promises that should be fulfilled already
4. Flip ‘_shouldObserveWrites’ to be true
5. Release the collection lock
- depends on
-
SERVER-57624 Coordinator should check participant states first before waiting
- Closed
-
SERVER-50960 Modify PrimaryOnlyService's lookup() to take in a parameter indicating whether to wait on the rebuilding condition variable
- Closed
-
SERVER-55682 Improve ability to test ReshardingCoordinator from C++
- Closed
- is depended on by
-
SERVER-51495 Reenable reshard_collection_basic.js in sharding_continuous_config_stepdown suite
- Backlog
- is related to
-
SERVER-49572 Implement onReshardingParticipantTransition in the ReshardingCoordinatorObserver
- Closed
-
SERVER-50982 PrimaryOnlyService::lookupInstance should take an OperationContext and use interruptible waits
- Closed
- related to
-
SERVER-61483 Resharding coordinator fails to recover abort decision on step-up, attempts to commit operation as success, leading to data inconsistency
- Closed