Rate limited getMore retry re-dispatched after the OperationContext is destroyed crashes

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • 9.0 Required
    • Affects Version/s: None
    • Component/s: None
    • None
    • Query Execution
    • ALL
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      While adding sharded rate limiting suites in SERVER-114130, I'm seeing occasional crashes in mongos in tests that use getMores. The particular problem seems to be a getMore that has a retry scheduled by the AsyncResultsMerger before the opCtx associated with it is detached, but runs after getting detached. The retry captures a copy of the original RemoteCommandRequest, which holds a raw pointer to the original opCtx, but when the retry runs, the opCtx may have been destructed and both the pointer and any types owned by the opCtx, like its baton, are invalid.

      I've seen this manifest a couple ways, like a crash destructing a shared pointer (in this run), but from local runs with ASAN, the root cause seems to be accessing the opCtx's freed memory region via the copied RemoteCommandRequest.

      Running the rate limited suites with this AI suggested fix seems to resolve the issue:

      diff --git a/src/mongo/s/query/exec/async_results_merger.cpp b/src/mongo/s/query/exec/async_results_merger.cpp
      index 32d8978f4f1..a23d729278d 100644
      --- a/src/mongo/s/query/exec/async_results_merger.cpp
      +++ b/src/mongo/s/query/exec/async_results_merger.cpp
      @@ -1404,8 +1404,27 @@ void AsyncResultsMerger::_handleBatchResponse(WithLock lk,
                           return;
                       }
       
      +                // The retry captured a copy of the original 'RemoteCommandRequest', which holds a
      +                // raw OperationContext pointer. By the time this callback runs, the originating
      +                // getMore may have returned and the ARM been detached from its OperationContext
      +                // (which may since have been destroyed - e.g. the cursor was checked in between
      +                // getMores). Never re-dispatch using the captured opCtx:
      +                //  - If detached, leave the remote reschedulable and defer the retry to the next
      +                //    '_scheduleGetMores()' after reattach. A failed (e.g. rate-limited) getMore did
      +                //    not advance the shard cursor, so this neither loses nor duplicates results.
      +                //  - If attached, retarget the request at the OperationContext we are attached to
      +                //    now (which may differ from the one the request was originally built with).
      +                if (!self->_opCtx) {
      +                    remote->outstandingRequest = false;
      +                    remote->cbHandle = executor::TaskExecutor::CallbackHandle();
      +                    self->_signalCurrentEventIfReady(lk);
      +                    return;
      +                }
      +                auto refreshedRequest = request;
      +                refreshedRequest.opCtx = self->_opCtx;
      +
                       remote->retryStrategy.recordBackoff(delay);
      -                auto status = self->_sendRequestWithRetries(lk, request, remote);
      +                auto status = self->_sendRequestWithRetries(lk, refreshedRequest, remote);
                       if (!status.isOK()) {
                           self->_signalCurrentEventIfReady(lk);
                       }
      

      SERVER-123666 and SERVER-123537 fixed similar issues, but my PR for SERVER-114130 contains both fixes, so I believe this is a new problem.

            Assignee:
            Mickey Winters
            Reporter:
            Jack Mulrow
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated: