-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: Checkpoints, Reconciliation
-
None
-
Storage Engines - Persistence
-
SE Persistence - 2026-02-13, SE Persistence - 2026-02-27
-
None
In disaggregated storage a page written with the same page-id, is not discarded on write, instead the disagg backend will know that the previous versions eventually become redundant. This "skip discard" optimization results in us not correctly tracking the used bytes in a btree. Page ID's are only reused for internal and leaf pages, and they are only reused if the page does not split. A 1:1 page replacement scenario is needed for a page to be re-written with the same page ID.
In the reproducer we wrote this is observed as the b-tree bytes in use metric increasing continually across checkpoints where we intentionally trigger 1:1 page replacements.
The functionality that determines whether a page is discarded in this instance appears to be contained within __rec_write_wrapup, when the WT_PM_REC_REPLACE rec result was seen. Taking the false branch in this if statement if (mod->mod_replace.block_cookie == NULL) is where we see the issue:
/*
* Free the disaggregated block if reconciliation results in zero pages, multiple
* pages, or a single empty page.
*/
if (page->disagg_info == NULL)
WT_RET(__wt_btree_block_free(
session, mod->mod_replace.block_cookie, mod->mod_replace.block_cookie_size));
else if (disagg_page_free_required) {
WT_RET(__wt_btree_block_free(
session, mod->mod_replace.block_cookie, mod->mod_replace.block_cookie_size));
page->disagg_info->block_meta.page_id = WT_BLOCK_INVALID_PAGE_ID;
}
Essentially in the failure case, page->disagg_info is not NULL and disagg_page_free_required is also false, thus __wt_btree_block_free is never called.
In the reproducer left in the comments the following behaviour is seen:
Write leaf page #1 - Add 67 bytes Write root page #1 - Add 61 bytes Write leaf page #2 - Add 67 bytes Write root page #2 - Add 67 bytes Discard root page #1 - Remove 61 bytes Write leaf page #3 - Add 67 bytes Write root page #2 - Add 67 bytes Discard root page #2 - Remove 61 bytes
Proposed idea: Introduce something along the following into the above code block:
else { // Crack the cookie here and decrement the checkpoint size with the cookie siz
Following on from this we have a number of concerns:
- We had the assumption that all block discards called into the block manager, for disagg this is not true.
- The block manager calls back into the b-tree layer to increment and decrement bytes, we should try and lift this functionality into btree_inline.h
- How can we catch reconciliation oddities in the future, or fix bugs in the field?