-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: Reconciliation
-
Storage Engines - Transactions
-
32.913
-
SE Transactions - 2026-05-08
-
3
Problem
In __rec_upd_select (src/reconcile/rec_visibility.c), when preserve_prepared=true is enabled and an aborted prepared insert is evicted before its prepare timestamp becomes stable, the rollback tombstone is incorrectly written to disk and marked with WT_UPDATE_DS. This flag poisoning causes a subsequent reconciliation (once the prepare timestamp is stable) to skip the preserved prepared cell entirely and write the tombstone again instead.
Update chain triggering the bug
Key 2 — fresh insert, no prior committed value:
[tombstone (WT_UPDATE_PREPARE_ROLLBACK)]
-> [aborted_prepare (prepare_ts=30, rollback_ts=50)]
Two-step failure
Step 1 — first eviction at stable=25 (prepare_ts=30 NOT yet stable):
The eviction session (session_evict) begins its transaction (txnid N) before session_prep is assigned txnid M, so N < M. With precise_checkpoint=true, eviction uses WT_REC_VISIBLE_NO_SNAPSHOT and rec_start_pinned_id = last_running = N. The skip block condition rec_start_pinned_id (N) <= upd_saved_txnid (M) is satisfied, so the aborted prepare is skipped via continue.
Before the fix: prepare_rollback_tombstone is not cleared inside the skip block. The post-loop fallback at line 1042 selects the tombstone as upd_select->upd and F_SET(upd_select->upd, WT_UPDATE_DS) marks it with WT_UPDATE_DS (which subsumes WT_UPDATE_SELECT_FOR_DS).
Step 2 — second reconciliation (checkpoint) at stable=35 (prepare_ts=30 now stable, rollback_ts=50 not stable):
The tombstone now has WT_UPDATE_DS set. This causes two failures:
1. The skip block condition !F_ISSET(upd, WT_UPDATE_SELECT_FOR_DS) is FALSE → skip block is bypassed, the tombstone is not skipped.
2. The prepare_rollback_tombstone assignment at line 982–984 also checks !F_ISSET(upd, WT_UPDATE_SELECT_FOR_DS) → prepare_rollback_tombstone is never set.
The tombstone is selected directly as upd_select->upd. The preserved prepared cell is never written: rec_time_window_prepared is not incremented, and the wrong data (tombstone) is written to disk instead of the prepared cell.
Root cause
The skip block at lines 823–850 in __rec_upd_select does not clear prepare_rollback_tombstone after skipping a non-visible prepared update. The tombstone set earlier in the same loop iteration remains as the selected update once the loop ends.
Fix
Add prepare_rollback_tombstone = NULL; inside the skip block (line 847). This ensures the tombstone is never selected by the post-loop fallback when the prepare is not yet visible, preventing it from acquiring WT_UPDATE_DS and poisoning subsequent reconciliations.
Test
New Python test test/suite/test_prepare46.py (test_rollback_tombstone_wrongly_gets_ds_flag) verifies the two-step failure:
1. At stable=35 (prepare_ts=30 stable, rollback_ts=50 not stable): rec_time_window_prepared must increment — the preserved prepared cell was correctly written. Without the fix this stat does not increment.
2. At stable=55 (rollback_ts=50 now stable): rec_time_window_prepared must not increment — the tombstone is cleanly written and no prepared cell is needed.
3. Reads at ts=20 resolve correctly through the history store throughout.