Eviction reconciliation asserts stop_txn >= start_txn on disagg stable btree during format-stress-test-disagg-switch

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Duplicate
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Not Applicable
    • Storage Engines - Foundations
    • 118.965
    • SE Persistence backlog
    • None

      Symptom

      While trying to reproduce WT-17304 with its config (format-stress-test-disagg-switchdisagg.mode=switch, ops.prepare=1, precise_checkpoint=1, preserve_prepared=1, runs.source=layered, disagg.multi=0), an eviction worker thread aborts on the following assertion:

      src/reconcile/rec_visibility.c:497
          WT_ASSERT(session, select_tw->stop_txn >= select_tw->start_txn);
      

      Stack:

      #3 __timestamp_no_ts_fix         rec_visibility.c:497
      #4 __wti_rec_upd_select          rec_visibility.c:1657
      #5 __wti_rec_row_leaf            rec_row.c:1138
      #6 __reconcile                   rec_write.c:316
      #7 __wt_reconcile                rec_write.c:127
      #8 __evict_reconcile             evict_page.c:1277
      #9 __wt_evict                    evict_page.c:443
      #10 __wti_evict_page             evict_dispatch.c:254
      #11 __wti_evict_lru_pages        evict_queue.c:140
      #12 __evict_thread_run           evict_thread.c:117
      #13 __thread_run                 thread_group.c:32
      

      The faulting page is on a layered table's stable constituent: session->dhandle->name = "file:T00001.wt_stable".

      State at the assertion

      The selected time window for the row about to be reconciled has its txn ids inverted while its timestamps are correctly ordered:

      (gdb) p *select_tw
      $3 = {
        durable_start_ts = 3187849,
        start_ts        = 3187849,
        start_prepare_ts = 0,
        start_txn       = 1599698,
        start_prepared_id = 0,
        durable_stop_ts = 3188959,
        stop_ts         = 3188959,
        stop_prepare_ts = 0,
        stop_txn        = 1599460,
        stop_prepared_id = 0
      }
      
      • stop_ts (3188959) > start_ts (3187849) → timestamps consistent.
      • stop_txn (1599460) < start_txn (1599698) → assertion fails.

      Update chain that produced it

      For the same key, the in-memory update chain visited by __wti_rec_upd_select is:

      WT_ROW_UPDATE(page, rip) = 0x138e1a85ddc0   /* head: tombstone */
        txnid        = 1599460
        durable_ts   = 3188959, start_ts = 3188959
        prepared_id  = 0, prepare_ts = 0
        prepare_state = 0
        type         = 4   /* WT_UPDATE_TOMBSTONE */
        flags        = 0x800   /* WT_UPDATE_RESTORED_FROM_INGEST */
        next         -> 0x138e08aac7d0
      
      next:
        txnid        = 1599698
        durable_ts   = 3187849, start_ts = 3187849
        prepared_id  = 0, prepare_ts = 0
        prepare_state = 0
        type         = 3   /* WT_UPDATE_STANDARD */
        flags        = 0x204   /* WT_UPDATE_DS | WT_UPDATE_RESTORED_FROM_DS */
      

      Notes:

      • The head update is a tombstone flagged WT_UPDATE_RESTORED_FROM_INGEST — it was created during the ingest → stable drain on step-up, carrying the ingest btree's transaction id verbatim.
      • The underlying value is flagged WT_UPDATE_DS | WT_UPDATE_RESTORED_FROM_DS — it was instantiated from the stable's on-disk cell when the page was read in.
      • Neither update is prepared; both have prepare_state=0, prepared_id=0, prepare_ts=0.

      Additional observations

      • S2BT(session)>base_write_gen = 19 for this btree, whereas the page header reports a much larger write_gen: ((WT_PAGE_HEADER *)page>dsk)->write_gen = 667.
      • Because dsk_write_gen > base_write_gen, the existing on-disk-cell txn-id cleanup (_cell_unpack_window_need_cleanup_cell_kv_window_cleanup) is skipped for this page; the RESTORED_FROM_DS update therefore retains whatever start_txn was stored on disk.
      • The tombstone's txnid=1599460 appears to be a real current-run ingest txn id, not a leaked one — i.e. the moved-from-ingest update legitimately preserves its txn id.

      So the two halves of the chain were produced by different code paths that did not coordinate on the txn-id namespace, leading to the stop_txn < start_txn state that the assertion forbids.

      Reproduction

      • Reproduces against the WT-17304 task configuration on aarch64.
      • Reproducer: format-stress-test-disagg-switch with disagg.mode=switch, ops.prepare=1, precise_checkpoint=1, preserve_prepared=1, runs.source=layered, disagg.multi=0.

      Open questions / what's not yet established

      • Whether the root cause is:
        • (a) the ingest-drain code path putting a tombstone with a non-comparable txn id onto a stable chain;
        • (b) the RESTORED_FROM_DS value retaining a stale on-disk start_txn that was not cleared because the page header reports a current-run write_gen while its base image is from an earlier epoch;
        • (c) the assertion itself being too strict for the layered-table case; or
        • some combination.
      • Whether this is the same root cause as WT-17304 (the discover-walk-skip / lost prepare-bit story) or a sibling bug.

      The base_write_gen=19 vs dsk_write_gen=667 mismatch is suggestive but has not been proven to be the cause; it's listed in "Observations" not "Root cause".

      Related

      • WT-17304format-stress-test-disagg-switch timed out with prepare-conflict. Surfaced this while reproducing.

            Assignee:
            [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            Chenhao Qu
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: