Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 8.3 Required, 9.0.0-rc0
Affects Version/s: None
Component/s: Distributed Query Execution
Labels:
None

Assigned Teams:

Query Execution
Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v8.3
Sprint:
QE 2026-05-11, QE 2026-04-27
Linked BF Score:
200
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Problem

AsyncResultsMerger::_handleBatchResponse can self-deadlock the reactor thread when a retry is scheduled through a dead SubBaton. This permanently blocks the NetworkInterfaceTL reactor, killing all connection pool maintenance and networking for that executor.

Root Cause

In src/mongo/s/query/exec/async_results_merger.cpp lines 1385-1415, _handleBatchResponse is called while holding ARM _mutex (via WithLock). In the retry path, it calls:

_subBaton
->waitUntil(getGlobalServiceContext()->getPreciseClockSource()->now() + delay,
_cancellationSource.token())
.getAsync([weak = weak_from_this(), ...](Status s) {
auto self = weak.lock();
if (!self) return;
stdx::lock_guardstdx::mutex lk(self->_mutex);  // DEADLOCK
...
});

If the SubBaton is dead (parent baton detached), SubBaton::waitUntil returns an already-resolved future (kDetached status) at src/mongo/db/baton.cpp line 131. FutureImpl::getAsync then invokes the callback inline on the current thread, which already holds _mutex.

Impact

The reactor thread (NetworkInterfaceTL-TaskExecutorPool-N) is permanently blocked waiting on the ARM's mutex
All connection pool timers and networking on that reactor stop
Any subsequent getMore that tries to check out the same cursor blocks in reattachToOperationContext waiting on the same ARM mutes, causing the operation to hang until test/operation timeout
Observed as an hour-long hang in sharded getMore tests using the failIngressRequestRateLimiting failpoint

Reproduction

The deadlock requires:

An ARM with an outstanding remote getMore request
The parent baton (from the original opCtx) is detached, setting SubBaton::_isDead = true
The shard returns a retryable error (e.g. IngressRequestRateLimitExceeded)
_handleBatchResponse enters the retry path (line 1376-1378)
_subBaton->waitUntil(...) returns immediately
.getAsync runs the callback inlineself-deadlock

Stack Traces

Reactor thread (self-deadlocked):

#0  __lll_lock_wait ()
#2  AsyncResultsMerger::_handleBatchResponse(...)::$_0  // .getAsync callback trying to re-acquire _mutex
#3  AsyncResultsMerger::_handleBatchResponse(WithLock, ...)  // already holds _mutex
#10 NetworkInterfaceThreadPool::_consumeTasksInline(...)
#28 asio::detail::scheduler::run(...)
#29 AsioReactor::run()

Victim getMore thread (blocked on same mutex):

#0  __lll_lock_wait ()
#2  AsyncResultsMerger::reattachToOperationContext(...)  // waiting on _mutex
#4  ClusterCursorManager::checkOutCursor(...)
#5  ClusterFind::runGetMore(...)

Suggested Fix

Ensure the .getAsync callback at line 1388 never runs inline while _mutex is held. For example, route it through an executor:

_subBaton
->waitUntil(... + delay, _cancellationSource.token())
.thenRunOn(_executor)
.getAsync([weak = weak_from_this(), ...](Status s) { ... });

Alternatively, check for a dead SubBaton before entering the retry path and treat it as a terminal failure rather than scheduling a retry that will immediately self-deadlock.

blocks

SERVER-114130 Move the failRateLimiting fail point to the ingress request rate limiter in session workflow

Closed

related to

SERVER-128736 Rate limited getMore retry re-dispatched after the OperationContext is destroyed crashes

Closed

Assignee:: Jan Steemann
Reporter:: Blake Oler
Participants:: Blake Oler, Githook User, Jan Steemann
Votes:: 0 Vote for this issue
Watchers:: 3 Start watching this issue

Created:: Apr 07 2026 10:37:57 AM UTC
Updated:: Jun 10 2026 08:55:25 PM UTC
Resolved:: May 02 2026 10:40:31 AM UTC

Details

Description

Problem

Root Cause

Impact

Reproduction

Stack Traces

Suggested Fix

Attachments

Issue Links

Activity

People

Dates

PagerDuty