Loading...

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: WT12.0.0, 9.0.0-rc0
Affects Version/s: None
Component/s: Btree, Transactions
Labels:
- dc

Assigned Teams:

Storage Engines, Storage Engines - Transactions
Total Hours with Assigned Team:
0.275
Sprint:
SE Transactions - 2026-05-22
Story Points:
5

Summary

When the disaggregated drain copies a rolled-back follower prepare to the stable btree, it leaves a single ABORTED+INPROGRESS upd on the in-memory update chain with no fallback below it. When the prepare's rollback timestamp later becomes stable on the leader, reconciliation silently drops the marker from disk via the rollback-stable skip in rec_visibility.c. After the carrying page is evicted and re-read, the restored upd has txnid=0 instead of WT_TXN_ABORTED, which trips the ABORTED-safe guard in __layered_assert_stable_btree_state on the next step-up drain.

Bug Chain

Follower prepares an INSERT (key has no prior value on the stable btree) and rolls back with rollback_ts > prepare_ts.
Step-up triggers __layered_copy_ingest_table. The is_prepare_rollback branch (conn_layered_ingest.c around lines 641-680) allocates exactly one upd carrying the rolled-back value and stamps it:
- upd->txnid = WT_TXN_ABORTED
- upd->prepare_state = WT_PREPARE_INPROGRESS
- upd->prepare_ts = start_prepare_ts, upd->prepared_id = start_prepared_id
- upd->upd_saved_txnid = start_ts, upd->upd_rollback_ts = durable_start_ts
Checkpoint A reconciles the page. prepare_ts <= rec_start_pinned_stable_ts, so rec_visibility.c writes a PREPARE cell. WT_TIME_WINDOW_SET_START replaces WT_TXN_ABORTED with upd_saved_txnid on disk — so the on-disk cell carries start_txn=0 (when the original was a follower prepare).
Time advances. Stable timestamp moves past rollback_ts.
Checkpoint B reconciles the same page. The rollback-stable skip in rec_visibility.c now fires (upd_rollback_ts <= rec_start_pinned_stable_ts), and the upd is dropped — no cell is written for this key. On-disk state for this key vanishes.
The page is evicted and later re-read. Page restoration reconstructs the in-memory upd from the prior on-disk cell (Checkpoint A's PREPARE cell, still reachable via the page's block history) with txnid=0, prepare_state=INPROGRESS.
Step-up drain calls __layered_assert_stable_btree_state. The ABORTED-safe guard expects txnid == WT_TXN_ABORTED to skip preserved-rollback markers; the restored txnid=0 does not match, so the assert fires.

Why the Marker Was Designed to Be Transient

The drain's comment captures the original intent:

If the prepared update is aborted, move the aborted update to the stable table because we may write a prepared update to the disk in a future reconciliation.

The marker was meant to be discovered and resolved by prepare_discover on the follower side before rollback_ts reached stability. The case where the follower never sees the carrying checkpoint — because role switches between leader writes mean only the latest shared-metadata checkpoint gets picked up — was not anticipated. See ~~WT-17583~~ for the full investigation log evidence.

Log Evidence

From RUNDIR.8.log, prepared_id=10449 on file:T00003.wt_stable:

[PREPARE_TRACE] drain: else-branch ROLLBACK btree=file:T00003.wt_ingest
    prepared_id=10449 prepare_ts=97249 rollback_ts=97314 last_ckpt_ts=62121

[PREPARE_TRACE] rec: write INPROGRESS prepare to disk btree=file:T00003.wt_stable
    prepared_id=10449 prepare_ts=97249 rec_start_pinned_stable_ts=97313 is_checkpoint=1

[PREPARE_TRACE] rec: chain[0] txnid=18446744073709551615 type=3 prepare_state=1
    prepared_id=10449 prepare_ts=97249 start_ts=31920 durable_ts=97314
    rollback_ts=97314 saved_txnid=31920 flags=0x800 <-- selected

The chain dump shows the marker is the only upd on the chain. type=3 (WT_UPDATE_STANDARD), txnid=UINT64_MAX (WT_TXN_ABORTED), no follow-up tombstone, no HS-restored value.

Proposed Fix

Inside _layered_move_updates in conn_layered_ingest.c, after building the chain but before _wt_row_modify, detect the bug-prone configuration:

The chain being moved contains only the ABORTED+INPROGRESS marker (no other upds queued for this key)
The stable btree has no existing value for this key (the row search lands on a position where cbt->compare != 0, or there is no on-page value)

In that case, allocate a globally visible tombstone and append it as marker->next:

upd->txnid = WT_TXN_NONE
upd->upd_start_ts = WT_TS_NONE
upd->upd_durable_ts = WT_TS_NONE
upd->type = WT_UPDATE_TOMBSTONE

When Checkpoint B later drops the marker via the rollback-stable skip, the tombstone below it is what reconciliation writes — the correct post-rollback state for a key that did not exist before the prepare. Any future re-read restores a tombstone upd, not an INPROGRESS upd, and the assert no longer trips.

The UPDATE case (key did have a prior value on the stable btree) does not need this treatment: the prior value is already on disk and remains correct once the marker is dropped.

Verification

A new test test/suite/test_prepare_discover12.py reproduces the scenario deterministically:

Open as follower; prepare INSERT for a new key; rollback with rollback_ts > prepare_ts
Step up — drain creates the ABORTED+INPROGRESS marker on the stable btree
Checkpoint A — marker written to disk
Force eviction of the key's page
Advance stable past rollback_ts; Checkpoint B — marker dropped from disk
Verify the key is WT_NOTFOUND
Step down + write to ingest + step up — triggers a second drain
Verify the second drain does not assert and the key remains WT_NOTFOUND

Related Tickets

~~WT-17583~~ — initial investigation of the format failure (originally misdiagnosed as a producer-side ta.prepare bug; this ticket supersedes that diagnosis)
~~WT-17584~~ — refactor shared prepare-flag parsing between RTS and prepare_discover (unrelated cleanup discovered during this investigation)

is related to

WT-17583 Disaggregated: precise checkpoint writes wrong time aggregate (prepare=0) to shared metadata for stable btrees with skipped INPROGRESS prepared updates

Closed

WT-17584 Refactor checkpoint prepare-flag parsing to share code between prepare_discover and RTS

Closed

WT-17590 Replace WT_UPDATE_PREPARE_ROLLBACK flag with a tombstone appended at the end of the update chain

Closed

related to

WT-17590 Replace WT_UPDATE_PREPARE_ROLLBACK flag with a tombstone appended at the end of the update chain

Closed

Details

Description

Summary

Bug Chain

Why the Marker Was Designed to Be Transient

Log Evidence

Proposed Fix

Verification

Related Tickets

Attachments

Issue Links

Activity

People

Dates