Fix page ID reuse after discard in __rec_split_discard

XMLWordPrintableJSON

    • Storage Engines - Transactions
    • SE Transactions - 2026-04-24
    • 1

      Problem

      In disaggregated storage, once a block with a given page_id is discarded in the page log, that page_id is dead. The first write after a discard must have backlink_lsn = 0 (a fresh start). Reusing the page_id with a non-zero backlink_lsn references a chain that was already invalidated, causing verify_chain to fail with "Full page backlink_lsn mismatch", returning EINVAL and panicking the eviction thread.

      Code Flow

      Step 1 – Reconciliation produces MULTIBLOCK with 1 entry, reusing the page_id

      A page with page_id X is dirty and has saved updates that can't be written (e.g., an older reader holds a snapshot). Reconciliation produces a single block but goes through the WT_MULTI_SUPD_RESTORE path (rec_write.c:3103-3110):

      if (F_ISSET(r, WT_REC_IN_MEMORY) || F_ISSET(r->multi, WT_MULTI_SUPD_RESTORE)) {
          if (page->disagg_info != NULL)
              page->disagg_info->block_meta = *r->multi->block_meta;  /* page_id X copied back */
          goto split;  /* sets mod->rec_result = WT_PM_REC_MULTIBLOCK */
      }
      

      During the write phase (__rec_split_write_image), since r->multi_next == 1 and page->disagg_info->block_meta.page_id != WT_BLOCK_INVALID_PAGE_ID, the page_id X is reused. A block with page_id X is written to the page log (e.g., a delta at lsn=545145). The goto split path sets mod->rec_result = WT_PM_REC_MULTIBLOCK with mod->mod_multi_entries = 1.

      Step 2 – Eviction rewrites the page in memory

      Eviction handles WT_PM_REC_MULTIBLOCK with mod_multi_entries == 1 by calling _wt_split_rewrite (evict_page.c:676-678). Inside _split_multi_inmem (bt_split.c:1496-1497), the multi block's block_meta (page_id X) is copied into the new in-memory page:

      if (page->disagg_info != NULL) {
          page->disagg_info->block_meta = *multi->block_meta;  /* page_id X preserved */
      

      The page is now in memory with page_id X, dirty (has saved updates), and mod->rec_result is effectively 0 on the new page.

      Step 3 – Next reconciliation: __rec_split_discard discards the block but doesn't invalidate the page_id

      The page is reconciled again. In _rec_write_wrapup, the cleanup switch processes the previous mod->rec_result == WT_PM_REC_MULTIBLOCK and calls _rec_split_discard.

      If the new reconciliation produces != 1 page (r->multi_next != 1), free_blocks is true. The old block (with page_id X) is discarded via __wt_btree_block_free (rec_write.c:2751-2753).

      This sends a discard to the page log for page_id X. But page->disagg_info->block_meta.page_id is never set to WT_BLOCK_INVALID_PAGE_ID. The page_id X remains "valid" in memory.

      Step 4 – The new reconciliation reuses the discarded page_id

      If the new reconciliation produces a single page, __rec_split_write_image checks page->disagg_info->block_meta.page_id and finds X (still valid). It reuses page_id X with backlink_lsn pointing to the previous full page in the chain. A new full page (lsn=546143) is written with page_id X and backlink_lsn=463197. But page_id X was already discarded in step 3. Palite's verify_chain correctly rejects this.

      Evidence

      Page log state for (table_id=57, page_id=1965):

      lsn=546143, backlink=463197, base=0,      delta=0, discarded=0  <- new full page (reused page_id after discard)
      lsn=545145, backlink=463197, base=463197,  delta=1, discarded=1  <- discarded delta (step 3)
      lsn=463197, backlink=461766, base=0,       delta=0, discarded=0  <- valid previous full page
      

      The new full page (546143) has backlink_lsn=463197, but page_id was discarded at lsn=545145. After the fix, the page_id would have been invalidated in step 3, a fresh page_id would be allocated in step 4, and the write would have backlink_lsn=0.

      Fix

      Changes:

      1. Invalidate page_id in _rec_split_discard (src/reconcile/rec_write.c): After discarding a block via _wt_btree_block_free, check whether the discarded block's page_id matches page->disagg_info->block_meta.page_id. If so, invalidate it so the next reconciliation allocates a fresh page_id. This is the root cause fix.

      Also hoisted the disagg_page_free_required computation in __rec_write_wrapup before the switch statement to eliminate duplicate calculations across the case 0 and case WT_PM_REC_REPLACE branches.

      Impact

      Affects eviction, checkpoint, and cursor operations on disaggregated storage.

            Assignee:
            Haribabu Kommi
            Reporter:
            Haribabu Kommi
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: