Add TLA+ spec for snapshot-time vs operationTime ordering (companion to SERVER-120304)

    • Type: Task
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Summary

      Companion TLA+ specification for SERVER-120304 ("Snapshot time may advance before fetching a command's operationTime"). Models the (AcquireSnapshot, ExecuteRead, FetchOperationTime, SendResponse) interleaving with a concurrent AdvanceCommittedSnapshot writer, parameterised by a UseLiveCommittedSnapshot boolean that toggles between the pre-fix code (live committed-snapshot read) and the fix (getLastUsedReadTimestamp()).

      Files

      • src/mongo/tla_plus/Replication/SnapshotOperationTimeOrdering/SnapshotOperationTimeOrdering.tla (253 lines)
      • src/mongo/tla_plus/Replication/SnapshotOperationTimeOrdering/MCSnapshotOperationTimeOrdering.tla (15 lines)
      • src/mongo/tla_plus/Replication/SnapshotOperationTimeOrdering/MCSnapshotOperationTimeOrdering.cfg (32 lines)
      • src/mongo/tla_plus/Replication/SnapshotOperationTimeOrdering/OWNERS.yml (routes to 10gen/server-replication-reviewers)
      • src/mongo/tla_plus/Replication/SnapshotOperationTimeOrdering/README.md

      Invariant

      SnapshotMatchesOperationTime == \A r \in executedReads : r.opTime <= r.snapshot

      TLC verdicts

      Configuration Result States
      UseLiveCommittedSnapshot = FALSE (fix) No error 1398 generated / 530 distinct
      UseLiveCommittedSnapshot = TRUE (pre-fix) SnapshotMatchesOperationTime violated at depth 8 455 generated / 245 distinct

      The bug trace: ClockTick → AcquireSnapshot(t1)@committedSnapshot=0 → AdvanceCommittedSnapshot→1 → ClockTick → ExecuteRead → FetchOperationTime (reads live 1) → SendResponse → {{executedReads =

      {[snapshot |-> 0, opTime |-> 1]}

      }}. Mirrors the jstest in jstests/replsets/operation_time_majority_read_race.js.

      How to run

      cd src/mongo/tla_plus
      ./download-tlc.sh
      ./model-check.sh Replication/SnapshotOperationTimeOrdering
      

      Flip UseLiveCommittedSnapshot to TRUE in the cfg to reproduce the counterexample.

      Related

      • SERVER-120304 — snapshot/operationTime ordering bug (PR #49451 → reverted #53585 → re-landed #53614 in ~30 days)

            Assignee:
            Unassigned
            Reporter:
            Mehar Grewal
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: