Resharding coordinator gets stuck in infinite retry loop when abort is requested during commit phase

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Fixed
    • Priority: Major - P3
    • 8.3.0-rc0
    • Affects Version/s: 8.3.0-rc0
    • Component/s: None
    • None
    • Cluster Scalability
    • Fully Compatible
    • ALL
    • ClusterScalability Jan5-Jan19
    • 200
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      The resharding coordinator can get stuck in an infinite retry loop if an abort is requested right after it transitions to the committing state. When the coordinator reaches the committing state, it calls tellAllRecipientsToRefresh to send _flushReshardingStateChange commands to recipient shards. This function passes the abort token to the async RPC layer. If an abort command (e.g. abortUnshardCollection) arrives while this is in progress, the abort token is cancelled. The async RPC layer cancels the pending flushReshardingStateChange command when it detects the token cancellation, returning a CallbackCanceled error. Since CallbackCanceled is classified as a retriable error and the retry loop uses the stepdown token (which is still valid), the coordinator retries the operation. Each retry immediately fails with CallbackCanceled because the abort token remains cancelled, creating an infinite loop.

      Note that resharding cannot abort once it is past the committing state however in this case it doesn't continue to completion. This bug was introduced in SERVER-114005

            Assignee:
            Abdul Qadeer
            Reporter:
            Abdul Qadeer
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: