Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 9.0.0-rc1, 8.3.8
Affects Version/s: None
Component/s: None
Labels:
- shardedloadshedding9.0

Assigned Teams:

Query Execution
Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v8.3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

While adding sharded rate limiting suites in ~~SERVER-114130~~, I'm seeing occasional crashes in mongos in tests that use getMores. The particular problem seems to be a getMore that has a retry scheduled by the AsyncResultsMerger before the opCtx associated with it is detached, but runs after getting detached. The retry captures a copy of the original RemoteCommandRequest, which holds a raw pointer to the original opCtx, but when the retry runs, the opCtx may have been destructed and both the pointer and any types owned by the opCtx, like its baton, are invalid.

I've seen this manifest a couple ways, like a crash destructing a shared pointer (in this run), but from local runs with ASAN, the root cause seems to be accessing the opCtx's freed memory region via the copied RemoteCommandRequest.

Running the rate limited suites with this AI suggested fix seems to resolve the issue:

diff --git a/src/mongo/s/query/exec/async_results_merger.cpp b/src/mongo/s/query/exec/async_results_merger.cpp
index 32d8978f4f1..a23d729278d 100644
--- a/src/mongo/s/query/exec/async_results_merger.cpp
+++ b/src/mongo/s/query/exec/async_results_merger.cpp
@@ -1404,8 +1404,27 @@ void AsyncResultsMerger::_handleBatchResponse(WithLock lk,
                     return;
                 }
 
+                // The retry captured a copy of the original 'RemoteCommandRequest', which holds a
+                // raw OperationContext pointer. By the time this callback runs, the originating
+                // getMore may have returned and the ARM been detached from its OperationContext
+                // (which may since have been destroyed - e.g. the cursor was checked in between
+                // getMores). Never re-dispatch using the captured opCtx:
+                //  - If detached, leave the remote reschedulable and defer the retry to the next
+                //    '_scheduleGetMores()' after reattach. A failed (e.g. rate-limited) getMore did
+                //    not advance the shard cursor, so this neither loses nor duplicates results.
+                //  - If attached, retarget the request at the OperationContext we are attached to
+                //    now (which may differ from the one the request was originally built with).
+                if (!self->_opCtx) {
+                    remote->outstandingRequest = false;
+                    remote->cbHandle = executor::TaskExecutor::CallbackHandle();
+                    self->_signalCurrentEventIfReady(lk);
+                    return;
+                }
+                auto refreshedRequest = request;
+                refreshedRequest.opCtx = self->_opCtx;
+
                 remote->retryStrategy.recordBackoff(delay);
-                auto status = self->_sendRequestWithRetries(lk, request, remote);
+                auto status = self->_sendRequestWithRetries(lk, refreshedRequest, remote);
                 if (!status.isOK()) {
                     self->_signalCurrentEventIfReady(lk);
                 }

~~SERVER-123666~~ and ~~SERVER-123537~~ fixed similar issues, but my PR for ~~SERVER-114130~~ contains both fixes, so I believe this is a new problem.

is depended on by

SERVER-129844 Enable getMore tests in rate limited suites

Closed

is related to

SERVER-130891 AsyncResultsMerger retry reuses stale opCtx-derived timeout/deadline when re-dispatching after reattach to a different OperationContext

Backlog

SERVER-123537 AsyncResultsMerger::_handleBatchResponse self-deadlocks reactor thread when SubBaton is dead during retry

Closed

SERVER-123666 AsyncResultsMerger retryable error handling doesn't account for detached cursor state, causing deadlock or lost retries

Closed

SERVER-114130 Move the failRateLimiting fail point to the ingress request rate limiter in session workflow

Closed

Assignee:: Mickey Winters
Reporter:: Jack Mulrow
Participants:: Githook User, Jack Mulrow, Mickey Winters
Votes:: 0 Vote for this issue
Watchers:: 9 Start watching this issue

Created:: Jun 10 2026 08:55:13 PM UTC
Updated:: Jul 20 2026 05:01:13 PM UTC
Resolved:: Jul 02 2026 08:10:20 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates

PagerDuty