-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: Reconciliation
-
None
-
Storage Engines, Storage Engines - Transactions
-
0.043
-
SE Transactions - 2026-06-05
-
3
-
9
Summary
During connection close, WT_CONN_CLOSING causes _wt_txn_visible_all to return true unconditionally, bypassing the disaggregated pinned-timestamp cap (last_checkpoint_timestamp). This causes the ingest btree update selection loop in _rec_upd_select_inmem to break early at a committed preserved prepared update that appears globally visible, even though the per-btree prune threshold (rec_prune_timestamp) is 0 because no checkpoint has been picked up.
The fallback block introduced in WT-17684 then walks to the next update in the chain, which is also a committed preserved prepared update from a different transaction. That update is not prunable (rec_prune_timestamp == WT_TS_NONE), violating the assertion:
WT_ASSERT(session, WT_REC_CAN_PRUNE_UPD(fallback->txnid, fallback->upd_durable_ts, r));
Reproduction Scenario
Ingest btree update chain (newest to oldest) for a key on a follower that has not picked up any checkpoint:
- T2 (txnid=721716, durable_ts=771084, prepared_id=78786) — newer committed preserved prepared update
- T1 (txnid=336389, durable_ts=364469, prepared_id=34934) — older committed preserved prepared update; next=NULL
At connection close with WT_CONN_CLOSING set (flags_atomic=0x8C):
- __wt_txn_visible_all returns true for T2 (bypasses last_checkpoint_timestamp=0 cap)
- Old selection loop breaks at T2; upd_select->upd = T2
- Fallback block fires (T2 has prepared_id != WT_PREPARED_ID_NONE)
- T1 found as fallback; T1->txnid ≠ T2->txnid — first assert passes
- WT_REC_CAN_PRUNE_UPD(T1.txnid, T1.durable_ts=364469, r) with rec_prune_timestamp=0 → false — second assert fires
Key GDB values from core:
- r->rec_prune_timestamp = 0
- S2C(session)->disaggregated_storage.last_checkpoint_timestamp = 0
- S2C(session)->layered_table_manager.leader = false
- conn->flags_atomic = 140 (0x8C) → WT_CONN_CLOSING (0x4) is set
- txn_global.pinned_timestamp = 1140834 (application-set, but irrelevant — WT_CONN_CLOSING bypasses the cap)
Fix
In __rec_upd_select_inmem, do not use the global visibility check to terminate the update selection loop on ingest btrees for non-tombstone updates. The global visibility check is unsafe on ingest btrees because it can be bypassed (e.g. by WT_CONN_CLOSING) and the per-btree prune threshold is the correct gate. Tombstones on ingest btrees are always non-timestamped and are handled unconditionally.
Before:
if ((!F_ISSET(btree, WT_BTREE_GARBAGE_COLLECT) || upd->type != WT_UPDATE_TOMBSTONE ||
upd->upd_durable_ts == WT_TS_NONE) &&
__wt_txn_upd_visible_all(session, upd)) {
found_last_upd_to_keep = true;
break;
}
After:
if (F_ISSET(btree, WT_BTREE_GARBAGE_COLLECT) && upd->type == WT_UPDATE_TOMBSTONE) {
WT_ASSERT(session, upd->upd_durable_ts == WT_TS_NONE);
found_last_upd_to_keep = true;
break;
} else if (!F_ISSET(btree, WT_BTREE_GARBAGE_COLLECT) &&
__wt_txn_upd_visible_all(session, upd)) {
found_last_upd_to_keep = true;
break;
}
With this change the loop on ingest btrees runs past all non-prunable committed updates, selecting the true oldest, whose next is either NULL or a prunable update from a different transaction — satisfying the fallback assertions.