AsyncResultsMerger retryable error handling doesn't account for detached cursor state, causing deadlock or lost retries

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Query Execution
    • ALL
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Problem

      The ARM retry mechanism (introduced by SPM-4003) schedules retries via _subBaton->waitUntil(delay, token). This requires a live SubBaton, but the ARM is frequently detached between getMore calls (cursor stashed in ClusterCursorManager), during which _opCtx is null and the SubBaton is dead. Shard responses arrive asynchronously on the reactor thread regardless of whether the ARM is attached or detached.

      When a retryable error (e.g. IngressRequestRateLimitExceeded with RetryableError label) arrives while the ARM is detached:

      1. _subBaton->waitUntil() returns an already-resolved future (SubBaton dead)
      2. getAsync() invokes the retry callback inline on the reactor thread
      3. The callback tries to acquire _mutex, which is already held by _handleBatchResponse on the same thread
      4. Self-deadlock: the reactor thread is permanently blocked

      Design Gap?

      The technical design for overload retryability added retry support to the ARM, but, arguably correctly, did not account for the post-project change via SERVER-103733 to integrate the baton into the getMore pathway. The design assumes the ARM always has a live opCtx when a shard response arrives. After SERVER-103733, the detach window between getMore calls now contains a retryability gap. The detach window is when most shard responses arrive — this is the common case, not an edge case.

      Reproduction

      The deadlock is triggered reliably by the failIngressRequestRateLimiting failpoint in sharded passthrough suites. Any getMore that receives a retryable error while the cursor is stashed between client requests will deadlock the reactor thread.

      If the deadlock is avoided by simply skipping the retry when detached (gating the retry branch on _opCtx != null), the retryable error is instead surfaced to the client with zero retry attempts, violating the design's guarantee.

      Proposed Fix

      When the ARM is detached and the retry strategy says to retry, don't fail the batch and don't schedule on the dead SubBaton. Signal the current event and return, leaving the remote in a retriable state:

      • outstandingRequest is false (already cleared at top of _handleBatchResponse)
      • No error set on the remote or the ARM's _status
      • Cursor ID still valid (remote not exhausted)
      • On the next cursor checkout, reattachToOperationContext creates a fresh SubBaton, and _scheduleGetMores sees a remote with no outstanding request, no buffered results, and a valid cursor — it naturally re-sends the getMore. The retry strategy's attempt counter was already incremented by recordFailureAndEvaluateShouldRetry, so the budget is correctly tracked.
      } else if (retryStrategy.recordFailureAndEvaluateShouldRetry(
      parsedResponse.getStatus(),
      remote->shardHostAndPort,
      cbData.response.getErrorLabels())) {
      if (!_opCtx) {
      // ARM is detached — can't schedule retry on dead SubBaton.
      // Leave remote in retriable state; _scheduleGetMores will
      // re-send the getMore on reattach.
      _signalCurrentEventIfReady(lk);
      return;
      }
      // ... existing retry scheduling via _subBaton->waitUntil() ...
      

      Trade-offs

      • No backoff delay on deferred retries: The design specifies exponential backoff for SystemOverloadedError retries. When deferred to reattach, the getMore is re-sent immediately. This is acceptable because the client round-trip time (receive batch, process, issue next getMore) provides a natural delay comparable to the backoff parameters (100-400ms).
      • Retry attempt count is still tracked: recordFailureAndEvaluateShouldRetry increments the counter before the _opCtx check, so the max retry limit and adaptive budget are correctly enforced.

            Assignee:
            Unassigned
            Reporter:
            Blake Oler
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: