Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Query Execution
Operating System:
ALL
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Problem

The ARM retry mechanism (introduced by SPM-4003) schedules retries via _subBaton->waitUntil(delay, token). This requires a live SubBaton, but the ARM is frequently detached between getMore calls (cursor stashed in ClusterCursorManager), during which _opCtx is null and the SubBaton is dead. Shard responses arrive asynchronously on the reactor thread regardless of whether the ARM is attached or detached.

When a retryable error (e.g. IngressRequestRateLimitExceeded with RetryableError label) arrives while the ARM is detached:

_subBaton->waitUntil() returns an already-resolved future (SubBaton dead)
getAsync() invokes the retry callback inline on the reactor thread
The callback tries to acquire _mutex, which is already held by _handleBatchResponse on the same thread
Self-deadlock: the reactor thread is permanently blocked

Design Gap?

The technical design for overload retryability added retry support to the ARM, but, arguably correctly, did not account for the post-project change via SERVER-103733 to integrate the baton into the getMore pathway. The design assumes the ARM always has a live opCtx when a shard response arrives. After SERVER-103733, the detach window between getMore calls now contains a retryability gap. The detach window is when most shard responses arrive — this is the common case, not an edge case.

Reproduction

The deadlock is triggered reliably by the failIngressRequestRateLimiting failpoint in sharded passthrough suites. Any getMore that receives a retryable error while the cursor is stashed between client requests will deadlock the reactor thread.

If the deadlock is avoided by simply skipping the retry when detached (gating the retry branch on _opCtx != null), the retryable error is instead surfaced to the client with zero retry attempts, violating the design's guarantee.

Proposed Fix

When the ARM is detached and the retry strategy says to retry, don't fail the batch and don't schedule on the dead SubBaton. Signal the current event and return, leaving the remote in a retriable state:

outstandingRequest is false (already cleared at top of _handleBatchResponse)
No error set on the remote or the ARM's _status
Cursor ID still valid (remote not exhausted)
On the next cursor checkout, reattachToOperationContext creates a fresh SubBaton, and _scheduleGetMores sees a remote with no outstanding request, no buffered results, and a valid cursor — it naturally re-sends the getMore. The retry strategy's attempt counter was already incremented by recordFailureAndEvaluateShouldRetry, so the budget is correctly tracked.

} else if (retryStrategy.recordFailureAndEvaluateShouldRetry(
parsedResponse.getStatus(),
remote->shardHostAndPort,
cbData.response.getErrorLabels())) {
if (!_opCtx) {
// ARM is detached — can't schedule retry on dead SubBaton.
// Leave remote in retriable state; _scheduleGetMores will
// re-send the getMore on reattach.
_signalCurrentEventIfReady(lk);
return;
}
// ... existing retry scheduling via _subBaton->waitUntil() ...

Trade-offs

No backoff delay on deferred retries: The design specifies exponential backoff for SystemOverloadedError retries. When deferred to reattach, the getMore is re-sent immediately. This is acceptable because the client round-trip time (receive batch, process, issue next getMore) provides a natural delay comparable to the backoff parameters (100-400ms).
Retry attempt count is still tracked: recordFailureAndEvaluateShouldRetry increments the counter before the _opCtx check, so the max retry limit and adaptive budget are correctly enforced.

blocks

SERVER-114130 Move the failRateLimiting fail point to the ingress request rate limiter in session workflow

In Progress

Assignee:: Unassigned
Reporter:: Blake Oler
Participants:: Blake Oler
Votes:: 0 Vote for this issue
Watchers:: 1 Start watching this issue

Created:: Apr 08 2026 04:01:59 PM UTC
Updated:: Apr 08 2026 04:03:30 PM UTC

Details

Description

Problem

Design Gap?

Reproduction

Proposed Fix

Trade-offs

Attachments

Issue Links

Activity

People

Dates