-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Cluster Scalability
-
ALL
-
200
-
None
-
None
-
None
-
None
-
None
-
None
-
None
If a resharding operation is aborted by abortReshardCollection or setFCV downgrade command while the coordinator is transitioning from kInitializing to kPreparingToDonate, it can leave recipient shards stuck with orphaned state machines in config.localReshardingOperations.recipient. Recipients remain in awaiting-fetch-timestamp indefinitely because they are never notified of the abort.
When the resharding coordinator transitions to kPreparingToDonate, the disk write commits first, making participants aware of the resharding operation. The in-memory _coordinatorDoc update runs afterward using the same interruptible OperationContext. If an abort cancels that opCtx in the window between the disk write and the in-memory update, _coordinatorDoc state in memory still reads kInitializing. The abort handler dispatches on _coordinatorDoc.getState() and seeing state < kPreparingToDonate, it takes the coordinator-only abort path which skips notifying participants, leaving recipients stuck with orphaned state machines.
This is a different manifestation from the same fundamental problem as in SERVER-92857.
- is related to
-
SERVER-92857 Resharding Coordinator's abort hangs if it encounters an unrecoverable error while establishing participants
-
- Backlog
-