Resharding recipient service deadlock on rapid stepdown then stepup

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Replication
    • ALL
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      When a shard node experiences rapid stepdown and re-election while ReshardingRecipientService has a live instance, PrimaryOnlyService::onStepUp can deadlock indefinitely, preventing the node from completing its transition to primary and accepting writes.

      During step-up, OplogApplier-0 acquires the RSTL exclusively via AutoGetRstlForStepUpStepDown (replication_coordinator_impl.cpp:1430) and holds it for the duration of the handoff sequence. It then calls ReplicaSetAwareServiceRegistry::onStepUpComplete synchronously (replication_coordinator_impl.cpp:1505).

      Inside PrimaryOnlyService::onStepUp (primary_only_service.cpp:414), the code calls (*newThenOldScopedExecutor)->join(), waiting for the previous term's executor to drain. If the previous term's ReshardingRecipientService teardown task is still running and needs to write its final state (ReshardingOplogApplier::_clearAppliedOpsAndStoreProgress, resharding_oplog_applier.cpp:266), it requires RSTL-IX. Since the caller already holds RSTL-X, neither side can proceed.
       
       
       
      We may be able to fix this using a similar approach as was done in SERVER-73915.

            Assignee:
            Unassigned
            Reporter:
            Malik Endsley
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: