-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: Btree, Transactions
-
Storage Engines, Storage Engines - Transactions
-
0.275
-
SE Transactions - 2026-05-22
-
5
Summary
When the disaggregated drain copies a rolled-back follower prepare to the stable btree, it leaves a single ABORTED+INPROGRESS upd on the in-memory update chain with no fallback below it. When the prepare's rollback timestamp later becomes stable on the leader, reconciliation silently drops the marker from disk via the rollback-stable skip in rec_visibility.c. After the carrying page is evicted and re-read, the restored upd has txnid=0 instead of WT_TXN_ABORTED, which trips the ABORTED-safe guard in __layered_assert_stable_btree_state on the next step-up drain.
Bug Chain
- Follower prepares an INSERT (key has no prior value on the stable btree) and rolls back with rollback_ts > prepare_ts.
- Step-up triggers __layered_copy_ingest_table. The is_prepare_rollback branch (conn_layered_ingest.c around lines 641-680) allocates exactly one upd carrying the rolled-back value and stamps it:
- upd->txnid = WT_TXN_ABORTED
- upd->prepare_state = WT_PREPARE_INPROGRESS
- upd->prepare_ts = start_prepare_ts, upd->prepared_id = start_prepared_id
- upd->upd_saved_txnid = start_ts, upd->upd_rollback_ts = durable_start_ts
- Checkpoint A reconciles the page. prepare_ts <= rec_start_pinned_stable_ts, so rec_visibility.c writes a PREPARE cell. WT_TIME_WINDOW_SET_START replaces WT_TXN_ABORTED with upd_saved_txnid on disk — so the on-disk cell carries start_txn=0 (when the original was a follower prepare).
- Time advances. Stable timestamp moves past rollback_ts.
- Checkpoint B reconciles the same page. The rollback-stable skip in rec_visibility.c now fires (upd_rollback_ts <= rec_start_pinned_stable_ts), and the upd is dropped — no cell is written for this key. On-disk state for this key vanishes.
- The page is evicted and later re-read. Page restoration reconstructs the in-memory upd from the prior on-disk cell (Checkpoint A's PREPARE cell, still reachable via the page's block history) with txnid=0, prepare_state=INPROGRESS.
- Step-up drain calls __layered_assert_stable_btree_state. The ABORTED-safe guard expects txnid == WT_TXN_ABORTED to skip preserved-rollback markers; the restored txnid=0 does not match, so the assert fires.
Why the Marker Was Designed to Be Transient
The drain's comment captures the original intent:
If the prepared update is aborted, move the aborted update to the stable table because we may write a prepared update to the disk in a future reconciliation.
The marker was meant to be discovered and resolved by prepare_discover on the follower side before rollback_ts reached stability. The case where the follower never sees the carrying checkpoint — because role switches between leader writes mean only the latest shared-metadata checkpoint gets picked up — was not anticipated. See WT-17583 for the full investigation log evidence.
Log Evidence
From RUNDIR.8.log, prepared_id=10449 on file:T00003.wt_stable:
[PREPARE_TRACE] drain: else-branch ROLLBACK btree=file:T00003.wt_ingest
prepared_id=10449 prepare_ts=97249 rollback_ts=97314 last_ckpt_ts=62121
[PREPARE_TRACE] rec: write INPROGRESS prepare to disk btree=file:T00003.wt_stable
prepared_id=10449 prepare_ts=97249 rec_start_pinned_stable_ts=97313 is_checkpoint=1
[PREPARE_TRACE] rec: chain[0] txnid=18446744073709551615 type=3 prepare_state=1
prepared_id=10449 prepare_ts=97249 start_ts=31920 durable_ts=97314
rollback_ts=97314 saved_txnid=31920 flags=0x800 <-- selected
The chain dump shows the marker is the only upd on the chain. type=3 (WT_UPDATE_STANDARD), txnid=UINT64_MAX (WT_TXN_ABORTED), no follow-up tombstone, no HS-restored value.
Proposed Fix
Inside _layered_move_updates in conn_layered_ingest.c, after building the chain but before _wt_row_modify, detect the bug-prone configuration:
- The chain being moved contains only the ABORTED+INPROGRESS marker (no other upds queued for this key)
- The stable btree has no existing value for this key (the row search lands on a position where cbt->compare != 0, or there is no on-page value)
In that case, allocate a globally visible tombstone and append it as marker->next:
- upd->txnid = WT_TXN_NONE
- upd->upd_start_ts = WT_TS_NONE
- upd->upd_durable_ts = WT_TS_NONE
- upd->type = WT_UPDATE_TOMBSTONE
When Checkpoint B later drops the marker via the rollback-stable skip, the tombstone below it is what reconciliation writes — the correct post-rollback state for a key that did not exist before the prepare. Any future re-read restores a tombstone upd, not an INPROGRESS upd, and the assert no longer trips.
The UPDATE case (key did have a prior value on the stable btree) does not need this treatment: the prior value is already on disk and remains correct once the marker is dropped.
Verification
A new test test/suite/test_prepare_discover12.py reproduces the scenario deterministically:
- Open as follower; prepare INSERT for a new key; rollback with rollback_ts > prepare_ts
- Step up — drain creates the ABORTED+INPROGRESS marker on the stable btree
- Checkpoint A — marker written to disk
- Force eviction of the key's page
- Advance stable past rollback_ts; Checkpoint B — marker dropped from disk
- Verify the key is WT_NOTFOUND
- Step down + write to ingest + step up — triggers a second drain
- Verify the second drain does not assert and the key remains WT_NOTFOUND
Related Tickets
- is related to
-
WT-17583 Disaggregated: precise checkpoint writes wrong time aggregate (prepare=0) to shared metadata for stable btrees with skipped INPROGRESS prepared updates
-
- Closed
-
-
WT-17584 Refactor checkpoint prepare-flag parsing to share code between prepare_discover and RTS
-
- Open
-
-
WT-17590 Replace WT_UPDATE_PREPARE_ROLLBACK flag with a tombstone appended at the end of the update chain
-
- Closed
-
- related to
-
WT-17590 Replace WT_UPDATE_PREPARE_ROLLBACK flag with a tombstone appended at the end of the update chain
-
- Closed
-