Investigate resharding hang issue report by claude

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Cluster Scalability
    • 2
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Base commit 849369eda5291c6299cfccf9e4d551443f586779

      Resharding Hang/Deadlock Bug Analysis

      Overview

      Three confirmed bugs in the MongoDB resharding state machines where SharedPromise objects are not guaranteed to be fulfilled on all exit paths. Each affected promise is returned as a raw future (no withCancellation wrapper) from a public getter. If the promise is never fulfilled, any caller of the corresponding await* method hangs indefinitely.

      The common fix pattern for all three: add the missing ensureFulfilledPromise calls to _runMandatoryCleanup's error block.


      Bug 1: Donor `_inBlockingWritesOrError` Never Fulfilled on Abort or Stepdown

      File: src/mongo/db/s/resharding/resharding_donor_service.cpp

      Root Cause

      The promise _inBlockingWritesOrError is fulfilled in exactly one place: the onCompletion handler of _runUntilBlockingWritesOrErrored. That handler has an early return that skips fulfillment when the operation is aborted or the primary steps down:

      // ~line 453
      .onCompletion([this, executor](Status status) {
          if (_cancelState->isAbortedOrSteppingDown()) {
              return ExecutorFuture<void>(**executor, status);  // EARLY RETURN — no promise fulfillment
          }
          {
              std::lock_guard<std::mutex> lk(_mutex);
              ensureFulfilledPromise(lk, _inBlockingWritesOrError);
          }
          return ExecutorFuture<void>(**executor, status);
      })
      

      _runMandatoryCleanup (the "always runs" cleanup) fills the other promises on error but omits _inBlockingWritesOrError:

      // ~line 605
      >;if (!status.isOK()) {
          std::lock_guard<std::mutex> lk(_mutex);
          ensureFulfilledPromise(lk, _inDonatingOplogEntries, statusForPromise);
          ensureFulfilledPromise(lk, _changeStreamsMonitorStarted, statusForPromise);
          ensureFulfilledPromise(lk, _changeStreamsMonitorCompleted, statusForPromise);
          ensureFulfilledPromise(lk, _critSecWasAcquired, statusForPromise);
          ensureFulfilledPromise(lk, _critSecWasPromoted, statusForPromise);
          ensureFulfilledPromise(lk, _completionPromise, statusForPromise);
          // BUG: _inBlockingWritesOrError is missing
      }
      

      abort() (~line 1518) also does not fulfill it.

      Impact

      Any caller of awaitInBlockingWritesOrError() hangs indefinitely when resharding is aborted or the primary steps down.

      Fix

      Add to _runMandatoryCleanup's error block:

      ensureFulfilledPromise(lk, _inBlockingWritesOrError, statusForPromise);
      

      Bug 2: Recipient `_inStrictConsistencyOrError` Never Fulfilled on Abort or Stepdown

      File: src/mongo/db/s/resharding/resharding_recipient_service.cpp

      Root Cause

      The onCompletion handler of _runUntilStrictConsistencyOrErrored early-returns without fulfilling the promise when isAbortedOrSteppingDown():

      // ~line 493
      .onCompletion([this, executor](Status status) {
          if (_cancelState->isAbortedOrSteppingDown()) {
              return ExecutorFuture<void>(**executor, status);  // EARLY RETURN
          }
          {
              std::lock_guard<std::mutex> lk(_mutex);
              ensureFulfilledPromise(lk, _inStrictConsistencyOrError);
          }
          return ExecutorFuture<void>(**executor, status);
      })
      

      _runMandatoryCleanup's error block omits it:

      // ~line 647
      >;if (!outerStatus.isOK()) {
          std::lock_guard<std::mutex> lk(_mutex);
          ensureFulfilledPromise(lk, _changeStreamsMonitorStarted, statusForPromise);
          ensureFulfilledPromise(lk, _changeStreamsMonitorCompleted, statusForPromise);
          ensureFulfilledPromise(lk, _completionPromise, statusForPromise);
          // BUG: _inStrictConsistencyOrError is missing
      }
      

      _fulfillPromisesOnStepup does handle _inStrictConsistencyOrError for the stepup-recovery path, but that only executes when stepping back up as a primary with a pre-existing state document — it does not cover the abort/error case.

      Impact

      Any caller of awaitInStrictConsistencyOrError() hangs indefinitely when resharding is aborted or the primary steps down.

      Fix

      Add to _runMandatoryCleanup's error block:

      ensureFulfilledPromise(lk, _inStrictConsistencyOrError, statusForPromise);
      

      Bug 3: Recipient `_inApplyingOrError` Never Fulfilled on Abort or Pre-`kApplying` Error

      File: src/mongo/db/s/resharding/resharding_recipient_service.cpp

      Root Cause

      _inApplyingOrError is fulfilled in the success path by _buildIndexThenTransitionToApplying (~line 1425) and in _fulfillPromisesOnStepup for stepup recovery. Two failure paths miss it:

      Path A — error before kApplying is reached:

      The onError handler of _runUntilStrictConsistencyOrErrored (~line 461) fills _changeStreamsMonitorStarted, _changeStreamsMonitorCompleted, and _transitionedToCreateCollection when !isAbortedOrSteppingDown(), but does not fill _inApplyingOrError.

      Path B — abort or stepdown:

      _buildIndexThenTransitionToApplying's onCompletion (~line 1418) early-returns when isAbortedOrSteppingDown(), skipping the fulfillment. _runMandatoryCleanup's error block omits it entirely (see Bug 2 snippet above — both promises are missing from the same block).

      Impact

      If resharding fails during cloning or index building (before ever entering kApplying), or if it is aborted at any point before index building completes, any waiter on awaitInApplyingOrError() hangs forever.

      Fix

      Add to the onError handler's non-abort branch:

      ensureFulfilledPromise(lk, _inApplyingOrError, statusForPromise);
      

      And add to _runMandatoryCleanup's error block:

      ensureFulfilledPromise(lk, _inApplyingOrError, statusForPromise);
      

      Summary

      Bug Promise File Missing From
      1 _inBlockingWritesOrError resharding_donor_service.cpp _runMandatoryCleanup error block
      2 _inStrictConsistencyOrError resharding_recipient_service.cpp _runMandatoryCleanup error block
      3 _inApplyingOrError resharding_recipient_service.cpp _runMandatoryCleanup error block AND onError non-abort branch

      Why These Are Dangerous

      All three promises are returned as raw futures from public getters (e.g., awaitInBlockingWritesOrError()) with no withCancellation wrapper. This means callers receive no automatic unblocking signal when the operation aborts. The only way a waiter can be unblocked is if the promise is explicitly fulfilled — and on the abort/error paths, it is not.

      The correct model (already used elsewhere in the codebase) is either:

      • Wrap the returned future with withCancellation(future, abortOrStepdownToken), or
      • Guarantee all promises are fulfilled on every exit path, including in _runMandatoryCleanup.

      The _runMandatoryCleanup approach is already the established pattern for these state machines. All three bugs are fixed by consistently applying it to the missing promises.

            Assignee:
            Abdul Qadeer
            Reporter:
            Randolph Tan
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: