-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Replication
-
ALL
-
None
-
None
-
None
-
None
-
None
-
None
-
None
When a shard node experiences rapid stepdown and re-election while ReshardingRecipientService has a live instance, PrimaryOnlyService::onStepUp can deadlock indefinitely, preventing the node from completing its transition to primary and accepting writes.
During step-up, OplogApplier-0 acquires the RSTL exclusively via AutoGetRstlForStepUpStepDown (replication_coordinator_impl.cpp:1430) and holds it for the duration of the handoff sequence. It then calls ReplicaSetAwareServiceRegistry::onStepUpComplete synchronously (replication_coordinator_impl.cpp:1505).
Inside PrimaryOnlyService::onStepUp (primary_only_service.cpp:414), the code calls (*newThenOldScopedExecutor)->join(), waiting for the previous term's executor to drain. If the previous term's ReshardingRecipientService teardown task is still running and needs to write its final state (ReshardingOplogApplier::_clearAppliedOpsAndStoreProgress, resharding_oplog_applier.cpp:266), it requires RSTL-IX. Since the caller already holds RSTL-X, neither side can proceed.
We may be able to fix this using a similar approach as was done in SERVER-73915.
- is related to
-
SERVER-73915 TransactionCoordinatorService may stall primary step-up from completing when replica set shard steps down and back up quickly
-
- Closed
-