Resharding can hang on abort as coordinator takes wrong abort path due to stale state

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Cluster Scalability
    • ALL
    • 200
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      If a resharding operation is aborted by abortReshardCollection or setFCV downgrade command while the coordinator is transitioning from kInitializing to kPreparingToDonate, it can leave recipient shards stuck with orphaned state machines in config.localReshardingOperations.recipient. Recipients remain in awaiting-fetch-timestamp indefinitely because they are never notified of the abort. 

      When the resharding coordinator transitions to kPreparingToDonate, the disk write commits first, making participants aware of the resharding operation. The in-memory _coordinatorDoc update runs afterward using the same interruptible OperationContext. If an abort cancels that opCtx in the window between the disk write and the in-memory update, _coordinatorDoc state in memory still reads kInitializing. The abort handler dispatches on _coordinatorDoc.getState() and seeing state < kPreparingToDonate, it takes the coordinator-only abort path which skips notifying participants, leaving recipients stuck with orphaned state machines.

      This is a different manifestation from the same fundamental problem as in SERVER-92857

            Assignee:
            Unassigned
            Reporter:
            Abdul Qadeer
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: