Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- repl-shortlist

Assigned Teams:

Replication
Operating System:
ALL
Sprint:
Repl 2026-06-22, Repl 2026-06-22, Repl 2026-07-06, Repl 2026-07-20, Repl 2026-08-03
Case:
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

A resharding recipient node can permanently deadlock during step-up: it wins the election and reports PRIMARY, but never becomes a writable primary. The node logs 8025900 ("onStepUpComplete stepping up all services") but never 21331 ("Transition to primary complete"), and stays in this state until the process is killed. Reproduced on v8.0 and master; observed in production in HELP-93835 (v8.0.19).

How the deadlock forms

It takes two failovers of the same node while the resharding recipient is in the building-index state:

A resharding recipient instance starts the temp-collection index build (IndexBuildsCoordinator::startIndexBuild() call from resharding_recipient_service.cpp) and waits on its completion future. A stepdown interrupts the instance cleanly, but the two-phase index build survives by design.
The node steps up again. PrimaryOnlyService rebuilds the ReshardingRecipientService instance from the persisted state document; it re-runs _buildIndexThenTransitionToApplying, and startIndexBuild now returns IndexBuildAlreadyInProgress (the build from step 1 is still registered). The handler for that case calls the 2-argument IndexBuildsCoordinator::awaitNoIndexBuildInProgressForCollection (resharding_recipient_service.cpp), whose wait is a bare stdx::condition_variable::wait that accepts an opCtx and then ignores it (active_index_builds.cpp). It also runs synchronously inside the immediately-invoked lambda, before future_util::withCancellation wraps anything — so neither the opCtx kill nor the abort token can unwind it.
The node flaps once more. The stepdown interrupt (opCtx markKilled, POS registerOpCtx kill, RstlKillOpThread — all fire) is silently swallowed by the bare wait. When the node then wins the next election, the step-up deadlocks:

onStepUp (holds RSTL-X)  --joins-->  parked recipient instance
parked instance          --waits-->  index build to unregister (commit/abort)
commitIndexBuild         --needs-->  writable primary
writable primary         --needs-->  onStepUp to finish

The cycle is permanent: nothing can interrupt the parked instance, and the build cannot commit or abort while the node is not writable. In HELP-93835 it ended only when the automation SIGTERMed the node, which fasserted (7152000) after the 30s RSTL shutdown timeout.

Fix direction

Make the wait interruptible: the 3-argument overload of awaitNoIndexBuildInProgressForCollection already uses opCtx->waitForConditionOrInterrupt, so the killed opCtx throws at stepdown and the instance unwinds like any other interrupted phase (verified: with that one-line change the regression test goes green). Both recipient call sites (_buildIndexThenTransitionToApplying and the abort/cleanup path) use the vulnerable 2-argument overload, on v8.0 and master alike.

A deterministic regression test (jstests/sharding/resharding_recipient_buildindex_stepup_deadlock.js (see)) reproduces the deadlock on both branches by starving the index build's commit quorum and driving two stepdown/step-up flaps of the recipient primary.

Original deadlock theory (proven wrong): the progress-write path below is protected by multiple interruption nets (cancellable opCtx, POS opCtx registration, RstlKillOpThread) and doesn't deadlock:

When a shard node experiences rapid stepdown and re-election while ReshardingRecipientService has a live instance, PrimaryOnlyService::onStepUp can deadlock indefinitely, preventing the node from completing its transition to primary and accepting writes.

During step-up, OplogApplier-0 acquires the RSTL exclusively via AutoGetRstlForStepUpStepDown (replication_coordinator_impl.cpp:1430) and holds it for the duration of the handoff sequence. It then calls ReplicaSetAwareServiceRegistry::onStepUpComplete synchronously (replication_coordinator_impl.cpp:1505).

Inside PrimaryOnlyService::onStepUp (primary_only_service.cpp:414), the code calls (*newThenOldScopedExecutor)->join(), waiting for the previous term's executor to drain. If the previous term's ReshardingRecipientService teardown task is still running and needs to write its final state (ReshardingOplogApplier::_clearAppliedOpsAndStoreProgress, resharding_oplog_applier.cpp:266), it requires RSTL-IX. Since the caller already holds RSTL-X, neither side can proceed.

We may be able to fix this using a similar approach as was done in ~~SERVER-73915~~.

is related to

SERVER-73915 TransactionCoordinatorService may stall primary step-up from completing when replica set shard steps down and back up quickly

Closed

Assignee:: Pierre Turin
Reporter:: Malik Endsley (Inactive)
Participants:: Malik Endsley, Pierre Turin
Votes:: 0 Vote for this issue
Watchers:: 8 Start watching this issue

Created:: May 27 2026 10:54:16 PM UTC
Updated:: Jul 31 2026 11:20:36 PM UTC

Details

Description

How the deadlock forms

Fix direction

Attachments

Issue Links

Activity

People

Dates

PagerDuty