Drain leaks aborted prepared marker without a globally visible fallback

XMLWordPrintableJSON

    • Storage Engines, Storage Engines - Transactions
    • 0.275
    • SE Transactions - 2026-05-22
    • 5

      Summary

      When the disaggregated drain copies a rolled-back follower prepare to the stable btree, it leaves a single ABORTED+INPROGRESS upd on the in-memory update chain with no fallback below it. When the prepare's rollback timestamp later becomes stable on the leader, reconciliation silently drops the marker from disk via the rollback-stable skip in rec_visibility.c. After the carrying page is evicted and re-read, the restored upd has txnid=0 instead of WT_TXN_ABORTED, which trips the ABORTED-safe guard in __layered_assert_stable_btree_state on the next step-up drain.

      Bug Chain

      1. Follower prepares an INSERT (key has no prior value on the stable btree) and rolls back with rollback_ts > prepare_ts.
      2. Step-up triggers __layered_copy_ingest_table. The is_prepare_rollback branch (conn_layered_ingest.c around lines 641-680) allocates exactly one upd carrying the rolled-back value and stamps it:
        • upd->txnid = WT_TXN_ABORTED
        • upd->prepare_state = WT_PREPARE_INPROGRESS
        • upd->prepare_ts = start_prepare_ts, upd->prepared_id = start_prepared_id
        • upd->upd_saved_txnid = start_ts, upd->upd_rollback_ts = durable_start_ts
      3. Checkpoint A reconciles the page. prepare_ts <= rec_start_pinned_stable_ts, so rec_visibility.c writes a PREPARE cell. WT_TIME_WINDOW_SET_START replaces WT_TXN_ABORTED with upd_saved_txnid on disk — so the on-disk cell carries start_txn=0 (when the original was a follower prepare).
      4. Time advances. Stable timestamp moves past rollback_ts.
      5. Checkpoint B reconciles the same page. The rollback-stable skip in rec_visibility.c now fires (upd_rollback_ts <= rec_start_pinned_stable_ts), and the upd is dropped — no cell is written for this key. On-disk state for this key vanishes.
      6. The page is evicted and later re-read. Page restoration reconstructs the in-memory upd from the prior on-disk cell (Checkpoint A's PREPARE cell, still reachable via the page's block history) with txnid=0, prepare_state=INPROGRESS.
      7. Step-up drain calls __layered_assert_stable_btree_state. The ABORTED-safe guard expects txnid == WT_TXN_ABORTED to skip preserved-rollback markers; the restored txnid=0 does not match, so the assert fires.

      Why the Marker Was Designed to Be Transient

      The drain's comment captures the original intent:

      If the prepared update is aborted, move the aborted update to the stable table because we may write a prepared update to the disk in a future reconciliation.

      The marker was meant to be discovered and resolved by prepare_discover on the follower side before rollback_ts reached stability. The case where the follower never sees the carrying checkpoint — because role switches between leader writes mean only the latest shared-metadata checkpoint gets picked up — was not anticipated. See WT-17583 for the full investigation log evidence.

      Log Evidence

      From RUNDIR.8.log, prepared_id=10449 on file:T00003.wt_stable:

      [PREPARE_TRACE] drain: else-branch ROLLBACK btree=file:T00003.wt_ingest
          prepared_id=10449 prepare_ts=97249 rollback_ts=97314 last_ckpt_ts=62121
      
      [PREPARE_TRACE] rec: write INPROGRESS prepare to disk btree=file:T00003.wt_stable
          prepared_id=10449 prepare_ts=97249 rec_start_pinned_stable_ts=97313 is_checkpoint=1
      
      [PREPARE_TRACE] rec: chain[0] txnid=18446744073709551615 type=3 prepare_state=1
          prepared_id=10449 prepare_ts=97249 start_ts=31920 durable_ts=97314
          rollback_ts=97314 saved_txnid=31920 flags=0x800 <-- selected
      

      The chain dump shows the marker is the only upd on the chain. type=3 (WT_UPDATE_STANDARD), txnid=UINT64_MAX (WT_TXN_ABORTED), no follow-up tombstone, no HS-restored value.

      Proposed Fix

      Inside _layered_move_updates in conn_layered_ingest.c, after building the chain but before _wt_row_modify, detect the bug-prone configuration:

      • The chain being moved contains only the ABORTED+INPROGRESS marker (no other upds queued for this key)
      • The stable btree has no existing value for this key (the row search lands on a position where cbt->compare != 0, or there is no on-page value)

      In that case, allocate a globally visible tombstone and append it as marker->next:

      • upd->txnid = WT_TXN_NONE
      • upd->upd_start_ts = WT_TS_NONE
      • upd->upd_durable_ts = WT_TS_NONE
      • upd->type = WT_UPDATE_TOMBSTONE

      When Checkpoint B later drops the marker via the rollback-stable skip, the tombstone below it is what reconciliation writes — the correct post-rollback state for a key that did not exist before the prepare. Any future re-read restores a tombstone upd, not an INPROGRESS upd, and the assert no longer trips.

      The UPDATE case (key did have a prior value on the stable btree) does not need this treatment: the prior value is already on disk and remains correct once the marker is dropped.

      Verification

      A new test test/suite/test_prepare_discover12.py reproduces the scenario deterministically:

      1. Open as follower; prepare INSERT for a new key; rollback with rollback_ts > prepare_ts
      2. Step up — drain creates the ABORTED+INPROGRESS marker on the stable btree
      3. Checkpoint A — marker written to disk
      4. Force eviction of the key's page
      5. Advance stable past rollback_ts; Checkpoint B — marker dropped from disk
      6. Verify the key is WT_NOTFOUND
      7. Step down + write to ingest + step up — triggers a second drain
      8. Verify the second drain does not assert and the key remains WT_NOTFOUND

      Related Tickets

      • WT-17583 — initial investigation of the format failure (originally misdiagnosed as a producer-side ta.prepare bug; this ticket supersedes that diagnosis)
      • WT-17584 — refactor shared prepare-flag parsing between RTS and prepare_discover (unrelated cleanup discovered during this investigation)

            Assignee:
            Chenhao Qu
            Reporter:
            Chenhao Qu
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: