TLA+ spec + jstest for balancer round committing after balancerStop:1 returns OK (SERVER-100155)

    • Type: Task
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Summary

      TLA+ spec + jstest for SERVER-100155 (CAR-impact 5): a stepdown-interleave lets a migration commit AFTER balancerStop returns OK. Breaks every operator workflow that runs mongodump / mongosync after stopping the balancer — including documented backup procedures.

      Files

      • src/mongo/tla_plus/Sharding/BalancerStopRace/BalancerStopRace.tla (306 lines) — composes balancer-coordinator × config-stepdown × migration-commit-path. AllowStepdownInterleave bug toggle; ghost counter commitsAfterStopOk; invariant StopThenObserveStableState. Migration table modelled as a fixed-arity function [1..MaxMigrations -> {unset, commitInFlight, committed, aborted\}] rather than a sequence, to keep state space tractable.
      • MCBalancerStopRace.tla / .cfg (green) + _Bug.cfg
      • OWNERS.yml → 10gen/server-sharding
      • README.md (295 words)
      • jstests/sharding/balancer_stop_no_round_after_return.js (196 lines) — pins donor at hangBeforeSendingCommitDecision, fires configRS.stepUp mid-commit, runs balancerStop in a parallel shell that stamps the OK timestamp into a sentinel doc, releases the failpoint, then asserts every chunk's history[].validAfter predates that OK timestamp.

      TLC verdicts

      • Green: 373 states / 197 distinct / depth 15 / "No error has been found".
      • Bug: Invariant StopThenObserveStableState is violated with predicted 8-state counterexample BeginRound → DispatchMigration → StepDown → EndRoundOnInterrupt → BalancerStopBegin → BalancerStopReturnOk → ShardCommitsMigration; terminal commitsAfterStopOk = 1.

      Key insight surfaced by the spec

      The bug isn't strictly that joinCurrentRound returns early — it's that _endRound clears _inBalancerRound on the InterruptedDueToReplStateChange catch path without draining migrations whose future-chain has already been forked to the shard. The spec-implied fix: _endRound on the interrupt branch must wait for the command scheduler's in-flight migrations to reach a terminal committed | aborted state.

            Assignee:
            Unassigned
            Reporter:
            Mehar Grewal
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: