Ingest btree reconciliation asserts on non-prunable fallback update when WT_CONN_CLOSING makes all updates globally visible

XMLWordPrintableJSON

      Summary

      During connection close, WT_CONN_CLOSING causes _wt_txn_visible_all to return true unconditionally, bypassing the disaggregated pinned-timestamp cap (last_checkpoint_timestamp). This causes the ingest btree update selection loop in _rec_upd_select_inmem to break early at a committed preserved prepared update that appears globally visible, even though the per-btree prune threshold (rec_prune_timestamp) is 0 because no checkpoint has been picked up.

      The fallback block introduced in WT-17684 then walks to the next update in the chain, which is also a committed preserved prepared update from a different transaction. That update is not prunable (rec_prune_timestamp == WT_TS_NONE), violating the assertion:

      WT_ASSERT(session, WT_REC_CAN_PRUNE_UPD(fallback->txnid, fallback->upd_durable_ts, r));
      

      Reproduction Scenario

      Ingest btree update chain (newest to oldest) for a key on a follower that has not picked up any checkpoint:

      • T2 (txnid=721716, durable_ts=771084, prepared_id=78786) — newer committed preserved prepared update
      • T1 (txnid=336389, durable_ts=364469, prepared_id=34934) — older committed preserved prepared update; next=NULL

      At connection close with WT_CONN_CLOSING set (flags_atomic=0x8C):

      1. __wt_txn_visible_all returns true for T2 (bypasses last_checkpoint_timestamp=0 cap)
      2. Old selection loop breaks at T2; upd_select->upd = T2
      3. Fallback block fires (T2 has prepared_id != WT_PREPARED_ID_NONE)
      4. T1 found as fallback; T1->txnid ≠ T2->txnid — first assert passes
      5. WT_REC_CAN_PRUNE_UPD(T1.txnid, T1.durable_ts=364469, r) with rec_prune_timestamp=0 → false — second assert fires

      Key GDB values from core:

      • r->rec_prune_timestamp = 0
      • S2C(session)->disaggregated_storage.last_checkpoint_timestamp = 0
      • S2C(session)->layered_table_manager.leader = false
      • conn->flags_atomic = 140 (0x8C) → WT_CONN_CLOSING (0x4) is set
      • txn_global.pinned_timestamp = 1140834 (application-set, but irrelevant — WT_CONN_CLOSING bypasses the cap)

      Fix

      In __rec_upd_select_inmem, do not use the global visibility check to terminate the update selection loop on ingest btrees for non-tombstone updates. The global visibility check is unsafe on ingest btrees because it can be bypassed (e.g. by WT_CONN_CLOSING) and the per-btree prune threshold is the correct gate. Tombstones on ingest btrees are always non-timestamped and are handled unconditionally.

      Before:

      if ((!F_ISSET(btree, WT_BTREE_GARBAGE_COLLECT) || upd->type != WT_UPDATE_TOMBSTONE ||
            upd->upd_durable_ts == WT_TS_NONE) &&
        __wt_txn_upd_visible_all(session, upd)) {
          found_last_upd_to_keep = true;
          break;
      }
      

      After:

      if (F_ISSET(btree, WT_BTREE_GARBAGE_COLLECT) && upd->type == WT_UPDATE_TOMBSTONE) {
          WT_ASSERT(session, upd->upd_durable_ts == WT_TS_NONE);
          found_last_upd_to_keep = true;
          break;
      } else if (!F_ISSET(btree, WT_BTREE_GARBAGE_COLLECT) &&
        __wt_txn_upd_visible_all(session, upd)) {
          found_last_upd_to_keep = true;
          break;
      }
      

      With this change the loop on ingest btrees runs past all non-prunable committed updates, selecting the true oldest, whose next is either NULL or a prunable update from a different transaction — satisfying the fallback assertions.

            Assignee:
            Chenhao Qu
            Reporter:
            Chenhao Qu
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: