Fix reconciliation incorrectly selecting prepare rollback tombstone when rollback timestamp is not yet stable

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Fixed
    • Priority: Major - P3
    • WT12.0.0
    • Affects Version/s: None
    • Component/s: Reconciliation
    • Storage Engines - Transactions
    • 32.913
    • SE Transactions - 2026-05-08
    • 3

      Problem

      In __rec_upd_select (src/reconcile/rec_visibility.c), when preserve_prepared=true is enabled and an aborted prepared insert is evicted before its prepare timestamp becomes stable, the rollback tombstone is incorrectly written to disk and marked with WT_UPDATE_DS. This flag poisoning causes a subsequent reconciliation (once the prepare timestamp is stable) to skip the preserved prepared cell entirely and write the tombstone again instead.

      Update chain triggering the bug

      Key 2 — fresh insert, no prior committed value:
        [tombstone (WT_UPDATE_PREPARE_ROLLBACK)]
          -> [aborted_prepare (prepare_ts=30, rollback_ts=50)]
      

      Two-step failure

      Step 1 — first eviction at stable=25 (prepare_ts=30 NOT yet stable):

      The eviction session (session_evict) begins its transaction (txnid N) before session_prep is assigned txnid M, so N < M. With precise_checkpoint=true, eviction uses WT_REC_VISIBLE_NO_SNAPSHOT and rec_start_pinned_id = last_running = N. The skip block condition rec_start_pinned_id (N) <= upd_saved_txnid (M) is satisfied, so the aborted prepare is skipped via continue.

      Before the fix: prepare_rollback_tombstone is not cleared inside the skip block. The post-loop fallback at line 1042 selects the tombstone as upd_select->upd and F_SET(upd_select->upd, WT_UPDATE_DS) marks it with WT_UPDATE_DS (which subsumes WT_UPDATE_SELECT_FOR_DS).

      Step 2 — second reconciliation (checkpoint) at stable=35 (prepare_ts=30 now stable, rollback_ts=50 not stable):

      The tombstone now has WT_UPDATE_DS set. This causes two failures:
      1. The skip block condition !F_ISSET(upd, WT_UPDATE_SELECT_FOR_DS) is FALSE → skip block is bypassed, the tombstone is not skipped.
      2. The prepare_rollback_tombstone assignment at line 982–984 also checks !F_ISSET(upd, WT_UPDATE_SELECT_FOR_DS)prepare_rollback_tombstone is never set.

      The tombstone is selected directly as upd_select->upd. The preserved prepared cell is never written: rec_time_window_prepared is not incremented, and the wrong data (tombstone) is written to disk instead of the prepared cell.

      Root cause

      The skip block at lines 823–850 in __rec_upd_select does not clear prepare_rollback_tombstone after skipping a non-visible prepared update. The tombstone set earlier in the same loop iteration remains as the selected update once the loop ends.

      Fix

      Add prepare_rollback_tombstone = NULL; inside the skip block (line 847). This ensures the tombstone is never selected by the post-loop fallback when the prepare is not yet visible, preventing it from acquiring WT_UPDATE_DS and poisoning subsequent reconciliations.

      Test

      New Python test test/suite/test_prepare46.py (test_rollback_tombstone_wrongly_gets_ds_flag) verifies the two-step failure:
      1. At stable=35 (prepare_ts=30 stable, rollback_ts=50 not stable): rec_time_window_prepared must increment — the preserved prepared cell was correctly written. Without the fix this stat does not increment.
      2. At stable=55 (rollback_ts=50 now stable): rec_time_window_prepared must not increment — the tombstone is cleanly written and no prepared cell is needed.
      3. Reads at ts=20 resolve correctly through the history store throughout.

            Assignee:
            Chenhao Qu
            Reporter:
            Chenhao Qu
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: