Add TLA+ spec for resharding coordinator state machine (catches SERVER-115139 race)

    • Type: Task
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Summary

      TLA+ specification at src/mongo/tla_plus/Sharding/ReshardingCoordinator/ models the resharding coordinator state machine and the SERVER-115139 race. TLC produces a counterexample showing that an _shardsvrAbortReshardCollection arriving in the (Stepup, InitCancelSource) window is silently dropped under the buggy implementation.

      Files

      • ReshardingCoordinator.tla (511 lines) — actors: 1 Coordinator, 2 Donors, 2 Recipients. State per actor: durable phase, primary, hasCancelSource, pendingAbort.
      • {{MCReshardingCoordinator. {tla,cfg}

        }} — green cfg (BugMode = FALSE). All invariants pass.

      • MCReshardingCoordinatorAbortBeforeInit.cfg — bug cfg (BugMode = TRUE). Encodes the SERVER-115139 race.
      • OWNERS.yml10gen/server-sharding.

      Invariants checked

      • AbortAlwaysHandled — if abortRequested, the abort must be observed somewhere or queued in pendingAbort.
      • NoCommitAfterAbort — coordinator does not commit through an observed abort.
      • EveryStepupInitializesCancelSource — no abort sits in pendingAbort on an actor that already has its cancellation source.
      • StepdownDestroysCancelSource, CoordMonotoneOnSuccess, ParticipantsRespectCoordinator, TypeOK.

      TLC verdicts

      cfg states / distinct depth result
      green (BugMode = FALSE) 295,339 / 62,683 25 all invariants hold
      bug (BugMode = TRUE) 6,803 / 2,073 8 AbortAlwaysHandled violated

      Counterexample (bug cfg, 6 states)

      1. Init: all actors primary, all hasCancelSource = TRUE.
      2. CoordAdvance: coordState = coordInitializing.
      3. Stepdown(c): primary[c] = FALSE, hasCancelSource[c] = FALSE.
      4. AbortRequest(c): abortRequested = TRUE; BugMode drops it — abortObserved stays FALSE, pendingAbort[c] stays FALSE.
      5. Stepup(c): primary[c] = TRUE, hasCancelSource[c] still FALSE (init gap).
      6. InitCancelSource(c): hasCancelSource[c] = TRUE, no queued abort to drain.

      Final state: abortRequested = TRUE, abortObserved = FALSE, all hasCancelSource = TRUE, no pendingAbort. AbortAlwaysHandled is violated — the abort was accepted by the router but is invisible to every cancellation source.

      Reproducing

      cd src/mongo/tla_plus
      ./download-tlc.sh
      ./model-check.sh Sharding/ReshardingCoordinator   # green: pass
      cp Sharding/ReshardingCoordinator/MCReshardingCoordinatorAbortBeforeInit.cfg \
         Sharding/ReshardingCoordinator/MCReshardingCoordinator.cfg
      ./model-check.sh Sharding/ReshardingCoordinator   # bug: AbortAlwaysHandled violated
      

      Related

      • SERVER-115139 — ReshardingDonorService cannot reliably handle the case where _shardsvrAbortReshardCollection command comes in before the cancellation source is initialized upon stepup recovery

            Assignee:
            Unassigned
            Reporter:
            Mehar Grewal
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: