Track the true max transaction id seen on a page during reconciliation, including invisible/newer updates

XMLWordPrintableJSON

    • Type: Improvement
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Transactions
    • None
    • Storage Engines - Transactions
    • 97.829
    • None
    • None

      Problem

      mod->rec_max_txn (src/include/btmem.h:384) is populated from r->max_txn, which the update-selection loops in src/reconcile/rec_visibility.c only advance for updates that reconciliation actually selects/considers visible:

      • __rec_upd_select() (rec_visibility.c:722-1008): updates that are aborted-but-kept, not yet globally visible, not yet stable (precise checkpoint), unresolved/rolled-back prepared, or belong to the evicting session's own transaction are skipped via continue without updating max_txn — they only set *has_newer_updatesp = true (e.g. lines 814-831, 854-871, 903-932). Only updates that clear those checks reach if (max_txn < txnid) max_txn = txnid; (lines 976-977).
      • __rec_upd_select_inmem() (rec_visibility.c:1017-1239) has the identical pattern (skip-without-tracking at lines 1092-1109, track only at lines 1116-1117).

      So rec_max_txn answers "what's the newest transaction id reconciliation could see and select," not "what's the newest transaction id that has ever touched this page." Any update with a higher txn id that was invisible at reconciliation time (not yet committed/stable/resolved) is invisible to rec_max_txn — it's only reflected indirectly via the has_newer_updatesp boolean, which feeds page-dirty tracking (first_dirty_txn) rather than a comparable txn id.

      There is a separate, related field, mod->update_txn (btmem.h:404), set unconditionally by every update applicator in __wt_page_modify_set() regardless of visibility (src/include/btree_inline.h:1011-1017). But its own comment calls it "fuzzy"/a heuristic ("can be a little fuzzy, otherwise this would need to be a compare and swap"), it's never reset, and it's currently consumed only by unrelated heuristics (bt_sync.c:226, evict_walk.c:188,735,737) — it isn't a precise, reconciliation-scoped count of the true max txn id observed in the update chain at the time of a given reconciliation.

      Why this matters

      WT-17949 proposes letting checkpoint skip re-reconciling a page that eviction already reconciled at a matching pinned stable timestamp, when rec_max_txn <= snap_min. As currently defined, rec_max_txn only reflects the visible/selected portion of the update chain, so it cannot by itself prove that no update with a txn id between rec_max_txn and snap_min exists on the page (e.g. an update that was invisible/unstable when eviction reconciled the page, but is now visible to the checkpoint snapshot). Relying on rec_max_txn alone for that decision is unsound.

      Proposal

      Add a new field (e.g. rec_max_txn_all/rec_max_txn_unvisited) populated during the same reconciliation walk in _rec_upd_select() / _rec_upd_select_inmem(), tracking the maximum txn id seen across every update encountered in the chain (visible or not, excluding only truly aborted/discarded updates), independent of the existing visibility-gated max_txn. Store it on WT_PAGE_MODIFY alongside rec_max_txn so checkpoint-skip logic (and any other consumer that needs a sound upper bound) can use it instead of the racy, unscoped update_txn heuristic.

      Relates to WT-17949.

            Assignee:
            [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            Chenhao Qu
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: