AsyncResultsMerger::_handleBatchResponse self-deadlocks reactor thread when SubBaton is dead during retry

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • 9.0 Required
    • Affects Version/s: None
    • Component/s: None
    • None
    • Query Execution
    • ALL
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Problem

      AsyncResultsMerger::_handleBatchResponse can self-deadlock the reactor thread when a retry is scheduled through a dead SubBaton. This permanently blocks the NetworkInterfaceTL reactor, killing all connection pool maintenance and networking for that executor.

      Root Cause

      In src/mongo/s/query/exec/async_results_merger.cpp lines 1385-1415, _handleBatchResponse is called while holding ARM _mutex (via WithLock). In the retry path, it calls:

      _subBaton
      ->waitUntil(getGlobalServiceContext()->getPreciseClockSource()->now() + delay,
      _cancellationSource.token())
      .getAsync([weak = weak_from_this(), ...](Status s) {
      auto self = weak.lock();
      if (!self) return;
      stdx::lock_guardstdx::mutex lk(self->_mutex);  // DEADLOCK
      ...
      });
      

      If the SubBaton is dead (parent baton detached), SubBaton::waitUntil returns an already-resolved future (kDetached status) at src/mongo/db/baton.cpp line 131. FutureImpl::getAsync then invokes the callback inline on the current thread, which already holds _mutex.

      Impact

      • The reactor thread (NetworkInterfaceTL-TaskExecutorPool-N) is permanently blocked waiting on the ARM's mutex
      • All connection pool timers and networking on that reactor stop
      • Any subsequent getMore that tries to check out the same cursor blocks in reattachToOperationContext waiting on the same ARM mutes, causing the operation to hang until test/operation timeout
      • Observed as an hour-long hang in sharded getMore tests using the failIngressRequestRateLimiting failpoint

      Reproduction

      The deadlock requires:

      • An ARM with an outstanding remote getMore request
      • The parent baton (from the original opCtx) is detached, setting SubBaton::_isDead = true
      • The shard returns a retryable error (e.g. IngressRequestRateLimitExceeded)
      • _handleBatchResponse enters the retry path (line 1376-1378)
      • _subBaton->waitUntil(...) returns immediately
      • .getAsync runs the callback inlineself-deadlock

      Stack Traces

      Reactor thread (self-deadlocked):

      #0  __lll_lock_wait ()
      #2  AsyncResultsMerger::_handleBatchResponse(...)::$_0  // .getAsync callback trying to re-acquire _mutex
      #3  AsyncResultsMerger::_handleBatchResponse(WithLock, ...)  // already holds _mutex
      #10 NetworkInterfaceThreadPool::_consumeTasksInline(...)
      #28 asio::detail::scheduler::run(...)
      #29 AsioReactor::run()
      

      Victim getMore thread (blocked on same mutex):

      #0  __lll_lock_wait ()
      #2  AsyncResultsMerger::reattachToOperationContext(...)  // waiting on _mutex
      #4  ClusterCursorManager::checkOutCursor(...)
      #5  ClusterFind::runGetMore(...)
      

      Suggested Fix

      Ensure the .getAsync callback at line 1388 never runs inline while _mutex is held. For example, route it through an executor:

      _subBaton
      ->waitUntil(... + delay, _cancellationSource.token())
      .thenRunOn(_executor)
      .getAsync([weak = weak_from_this(), ...](Status s) { ... });
      

      Alternatively, check for a dead SubBaton before entering the retry path and treat it as a terminal failure rather than scheduling a retry that will immediately self-deadlock.

            Assignee:
            Unassigned
            Reporter:
            Blake Oler
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: