CursorEstablisher can return CallbackCanceled over a retargeting error when a sibling retry backoff is cancelled

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Fixed
    • Priority: Major - P3
    • 9.0.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • None
    • Cluster Scalability
    • Fully Compatible
    • ALL
    • ClusterScalability 8Jun-22June
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      CursorEstablisher::_prioritizeFailures decides which failure to surface when multiple remotes fail with distinct errors. It is possible for it to return a CallbackCanceled ("Baton wait canceled") — produced when a retry backoff sleep is cancelled — in preference to a recoverable retargeting error such as
      StaleConfig.

      This is the same class of bug as SERVER-123598, which stopped CallbackCanceled from clobbering a stored error, but that fix deliberately excluded retargeting errors (&& !_currentErrorIsRetargetingError()), leaving them exposed.

      Example sequence:

      • CursorEstablisher sends two requests
      • First request is rejected by the shard's rate limiter (SystemOverloadedError), inducing a backoff sleep
      • Second request fails with StaleConfig, which is stored and triggers stopRetrying()
      • stopRetrying() cancels the first request's backoff, yielding CallbackCanceled
      • Final error returned is CallbackCanceled instead of StaleConfig

      Because CallbackCanceled is in the CancellationError category (not RetriableError), the recoverable StaleConfig refresh-and-retry path is lost and the operation fails with a non-retriable error. This was observed in the ingress-rate-limiter passthrough suites (where overload retries with backoff are
      common), surfacing to clients as e.g. "Executor error during getMore :: caused by :: Baton wait canceled" on $lookup getMores.

      A possible fix is to always prefer an existing error over a CallbackCanceled error, ie remove the !_currentErrorIsRetargetingError() exception.

            Assignee:
            Jack Mulrow
            Reporter:
            Jack Mulrow
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: