-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: Reconciliation
-
None
-
Storage Engines - Transactions
-
SE Transactions - 2026-04-24
-
1
Problem
In disaggregated storage, once a block with a given page_id is discarded in the page log, that page_id is dead. The first write after a discard must have backlink_lsn = 0 (a fresh start). Reusing the page_id with a non-zero backlink_lsn references a chain that was already invalidated, causing verify_chain to fail with "Full page backlink_lsn mismatch", returning EINVAL and panicking the eviction thread.
Code Flow
Step 1 – Reconciliation produces MULTIBLOCK with 1 entry, reusing the page_id
A page with page_id X is dirty and has saved updates that can't be written (e.g., an older reader holds a snapshot). Reconciliation produces a single block but goes through the WT_MULTI_SUPD_RESTORE path (rec_write.c:3103-3110):
if (F_ISSET(r, WT_REC_IN_MEMORY) || F_ISSET(r->multi, WT_MULTI_SUPD_RESTORE)) {
if (page->disagg_info != NULL)
page->disagg_info->block_meta = *r->multi->block_meta; /* page_id X copied back */
goto split; /* sets mod->rec_result = WT_PM_REC_MULTIBLOCK */
}
During the write phase (__rec_split_write_image), since r->multi_next == 1 and page->disagg_info->block_meta.page_id != WT_BLOCK_INVALID_PAGE_ID, the page_id X is reused. A block with page_id X is written to the page log (e.g., a delta at lsn=545145). The goto split path sets mod->rec_result = WT_PM_REC_MULTIBLOCK with mod->mod_multi_entries = 1.
Step 2 – Eviction rewrites the page in memory
Eviction handles WT_PM_REC_MULTIBLOCK with mod_multi_entries == 1 by calling _wt_split_rewrite (evict_page.c:676-678). Inside _split_multi_inmem (bt_split.c:1496-1497), the multi block's block_meta (page_id X) is copied into the new in-memory page:
if (page->disagg_info != NULL) { page->disagg_info->block_meta = *multi->block_meta; /* page_id X preserved */
The page is now in memory with page_id X, dirty (has saved updates), and mod->rec_result is effectively 0 on the new page.
Step 3 – Next reconciliation: __rec_split_discard discards the block but doesn't invalidate the page_id
The page is reconciled again. In _rec_write_wrapup, the cleanup switch processes the previous mod->rec_result == WT_PM_REC_MULTIBLOCK and calls _rec_split_discard.
If the new reconciliation produces != 1 page (r->multi_next != 1), free_blocks is true. The old block (with page_id X) is discarded via __wt_btree_block_free (rec_write.c:2751-2753).
This sends a discard to the page log for page_id X. But page->disagg_info->block_meta.page_id is never set to WT_BLOCK_INVALID_PAGE_ID. The page_id X remains "valid" in memory.
Step 4 – The new reconciliation reuses the discarded page_id
If the new reconciliation produces a single page, __rec_split_write_image checks page->disagg_info->block_meta.page_id and finds X (still valid). It reuses page_id X with backlink_lsn pointing to the previous full page in the chain. A new full page (lsn=546143) is written with page_id X and backlink_lsn=463197. But page_id X was already discarded in step 3. Palite's verify_chain correctly rejects this.
Evidence
Page log state for (table_id=57, page_id=1965):
lsn=546143, backlink=463197, base=0, delta=0, discarded=0 <- new full page (reused page_id after discard) lsn=545145, backlink=463197, base=463197, delta=1, discarded=1 <- discarded delta (step 3) lsn=463197, backlink=461766, base=0, delta=0, discarded=0 <- valid previous full page
The new full page (546143) has backlink_lsn=463197, but page_id was discarded at lsn=545145. After the fix, the page_id would have been invalidated in step 3, a fresh page_id would be allocated in step 4, and the write would have backlink_lsn=0.
Fix
Changes:
- Invalidate page_id in _rec_split_discard (src/reconcile/rec_write.c): After discarding a block via _wt_btree_block_free, check whether the discarded block's page_id matches page->disagg_info->block_meta.page_id. If so, invalidate it so the next reconciliation allocates a fresh page_id. This is the root cause fix.
Also hoisted the disagg_page_free_required computation in __rec_write_wrapup before the switch statement to eliminate duplicate calculations across the case 0 and case WT_PM_REC_REPLACE branches.
Impact
Affects eviction, checkpoint, and cursor operations on disaggregated storage.