Type: Task
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Cluster Scalability
Backwards Compatibility:
Fully Compatible
Story Points:
2
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Base commit 849369eda5291c6299cfccf9e4d551443f586779

Resharding Hang/Deadlock Bug Analysis

Overview

Three confirmed bugs in the MongoDB resharding state machines where SharedPromise objects are not guaranteed to be fulfilled on all exit paths. Each affected promise is returned as a raw future (no withCancellation wrapper) from a public getter. If the promise is never fulfilled, any caller of the corresponding await* method hangs indefinitely.

The common fix pattern for all three: add the missing ensureFulfilledPromise calls to _runMandatoryCleanup's error block.

Bug 1: Donor `_inBlockingWritesOrError` Never Fulfilled on Abort or Stepdown

File: src/mongo/db/s/resharding/resharding_donor_service.cpp

Root Cause

The promise _inBlockingWritesOrError is fulfilled in exactly one place: the onCompletion handler of _runUntilBlockingWritesOrErrored. That handler has an early return that skips fulfillment when the operation is aborted or the primary steps down:

// ~line 453
.onCompletion([this, executor](Status status) {
    if (_cancelState->isAbortedOrSteppingDown()) {
        return ExecutorFuture<void>(**executor, status);  // EARLY RETURN — no promise fulfillment
    }
    {
        std::lock_guard<std::mutex> lk(_mutex);
        ensureFulfilledPromise(lk, _inBlockingWritesOrError);
    }
    return ExecutorFuture<void>(**executor, status);
})

_runMandatoryCleanup (the "always runs" cleanup) fills the other promises on error but omits _inBlockingWritesOrError:

// ~line 605
>;if (!status.isOK()) {
    std::lock_guard<std::mutex> lk(_mutex);
    ensureFulfilledPromise(lk, _inDonatingOplogEntries, statusForPromise);
    ensureFulfilledPromise(lk, _changeStreamsMonitorStarted, statusForPromise);
    ensureFulfilledPromise(lk, _changeStreamsMonitorCompleted, statusForPromise);
    ensureFulfilledPromise(lk, _critSecWasAcquired, statusForPromise);
    ensureFulfilledPromise(lk, _critSecWasPromoted, statusForPromise);
    ensureFulfilledPromise(lk, _completionPromise, statusForPromise);
    // BUG: _inBlockingWritesOrError is missing
}

abort() (~line 1518) also does not fulfill it.

Impact

Any caller of awaitInBlockingWritesOrError() hangs indefinitely when resharding is aborted or the primary steps down.

Fix

Add to _runMandatoryCleanup's error block:

ensureFulfilledPromise(lk, _inBlockingWritesOrError, statusForPromise);

Bug 2: Recipient `_inStrictConsistencyOrError` Never Fulfilled on Abort or Stepdown

File: src/mongo/db/s/resharding/resharding_recipient_service.cpp

Root Cause

The onCompletion handler of _runUntilStrictConsistencyOrErrored early-returns without fulfilling the promise when isAbortedOrSteppingDown():

// ~line 493
.onCompletion([this, executor](Status status) {
    if (_cancelState->isAbortedOrSteppingDown()) {
        return ExecutorFuture<void>(**executor, status);  // EARLY RETURN
    }
    {
        std::lock_guard<std::mutex> lk(_mutex);
        ensureFulfilledPromise(lk, _inStrictConsistencyOrError);
    }
    return ExecutorFuture<void>(**executor, status);
})

_runMandatoryCleanup's error block omits it:

// ~line 647
>;if (!outerStatus.isOK()) {
    std::lock_guard<std::mutex> lk(_mutex);
    ensureFulfilledPromise(lk, _changeStreamsMonitorStarted, statusForPromise);
    ensureFulfilledPromise(lk, _changeStreamsMonitorCompleted, statusForPromise);
    ensureFulfilledPromise(lk, _completionPromise, statusForPromise);
    // BUG: _inStrictConsistencyOrError is missing
}

_fulfillPromisesOnStepup does handle _inStrictConsistencyOrError for the stepup-recovery path, but that only executes when stepping back up as a primary with a pre-existing state document — it does not cover the abort/error case.

Impact

Any caller of awaitInStrictConsistencyOrError() hangs indefinitely when resharding is aborted or the primary steps down.

Fix

Add to _runMandatoryCleanup's error block:

ensureFulfilledPromise(lk, _inStrictConsistencyOrError, statusForPromise);

Bug 3: Recipient `_inApplyingOrError` Never Fulfilled on Abort or Pre-`kApplying` Error

File: src/mongo/db/s/resharding/resharding_recipient_service.cpp

Root Cause

_inApplyingOrError is fulfilled in the success path by _buildIndexThenTransitionToApplying (~line 1425) and in _fulfillPromisesOnStepup for stepup recovery. Two failure paths miss it:

Path A — error before kApplying is reached:

The onError handler of _runUntilStrictConsistencyOrErrored (~line 461) fills _changeStreamsMonitorStarted, _changeStreamsMonitorCompleted, and _transitionedToCreateCollection when !isAbortedOrSteppingDown(), but does not fill _inApplyingOrError.

Path B — abort or stepdown:

_buildIndexThenTransitionToApplying's onCompletion (~line 1418) early-returns when isAbortedOrSteppingDown(), skipping the fulfillment. _runMandatoryCleanup's error block omits it entirely (see Bug 2 snippet above — both promises are missing from the same block).

Impact

If resharding fails during cloning or index building (before ever entering kApplying), or if it is aborted at any point before index building completes, any waiter on awaitInApplyingOrError() hangs forever.

Fix

Add to the onError handler's non-abort branch:

ensureFulfilledPromise(lk, _inApplyingOrError, statusForPromise);

And add to _runMandatoryCleanup's error block:

ensureFulfilledPromise(lk, _inApplyingOrError, statusForPromise);

Summary

Bug	Promise	File	Missing From
1	`_inBlockingWritesOrError`	`resharding_donor_service.cpp`	`_runMandatoryCleanup` error block
2	`_inStrictConsistencyOrError`	`resharding_recipient_service.cpp`	`_runMandatoryCleanup` error block
3	`_inApplyingOrError`	`resharding_recipient_service.cpp`	`_runMandatoryCleanup` error block AND `onError` non-abort branch

Why These Are Dangerous

All three promises are returned as raw futures from public getters (e.g., awaitInBlockingWritesOrError()) with no withCancellation wrapper. This means callers receive no automatic unblocking signal when the operation aborts. The only way a waiter can be unblocked is if the promise is explicitly fulfilled — and on the abort/error paths, it is not.

The correct model (already used elsewhere in the codebase) is either:

Wrap the returned future with withCancellation(future, abortOrStepdownToken), or
Guarantee all promises are fulfilled on every exit path, including in _runMandatoryCleanup.

The _runMandatoryCleanup approach is already the established pattern for these state machines. All three bugs are fixed by consistently applying it to the missing promises.

Details

Description

Resharding Hang/Deadlock Bug Analysis

Overview

Bug 1: Donor `_inBlockingWritesOrError` Never Fulfilled on Abort or Stepdown

Root Cause

Impact

Fix

Bug 2: Recipient `_inStrictConsistencyOrError` Never Fulfilled on Abort or Stepdown

Root Cause

Impact

Fix

Bug 3: Recipient `_inApplyingOrError` Never Fulfilled on Abort or Pre-`kApplying` Error

Root Cause

Impact

Fix

Summary

Why These Are Dangerous

Attachments

Activity

People

Dates

PagerDuty