-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Cache and Eviction
-
None
-
Storage Engines - Transactions
-
197.19
-
None
-
None
Summary
_wt_ref_addr_copy (src/include/btree_inline.h:1830) does a relaxed atomic load of ref->home and then passes the result to wt_off_page, which dereferences page->dsk unconditionally. Under contention with a deepening parent split, the relaxed load can pick up a transient state where ref->home is briefly observed as NULL on a leaf - and the same transient is observed by _wt_ref_is_root, producing two distinct failure modes from one race.
Two observed failure modes
1. SIGSEGV in __wt_off_page
#0 __wt_off_page (page=0x0, p=0x134f3fe9c280) src/include/btree_inline.h:1838
#1 __wt_ref_addr_copy src/include/btree_inline.h:1838
#2 __evict_review_obsolete_time_window src/evict/evict_page.c:958
#3 __evict_review src/evict/evict_page.c:1029
#4 __wt_evict src/evict/evict_page.c:409
#5 __wti_evict_page (is_server=false) src/evict/evict_dispatch.c:254
#6 __wti_evict_lru_pages src/evict/evict_queue.c:140
#7 __evict_thread_run src/evict/evict_thread.c:117
Seen in test-prepare-hs03-hook-timestamp on ubuntu2004-nonstandalone.
2. ASSERT_ALWAYS supd == NULL in __rec_root_write
_reconcile ([{{rec_write.c:406-408}}|src/reconcile/rec_write.c]) calls rec_root_write when wt_ref_is_root(ref) returns true. _wt_ref_is_root also relaxed-loads ref->home; on a leaf with transiently-NULL ref->home, it returns true and the root-only assertion fires on a leaf's legitimate mod_multi:
btree=file:test_hs09.wt_stable flags=0x4000 page_type=7 (WT_PAGE_ROW_LEAF) page_is_internal=0 rec_result=2 (WT_PM_REC_MULTIBLOCK) multi_entries=1 i=0 supd=0xffffb6952400 supd_entries=27 multi_flags=0x2 (WT_MULTI_SUPD_RESTORE) disk_image=(nil) addr.block_cookie=(nil)
Captured by an instrumentation patch (6a139e8fbf629c0007cfa8e8). Seen in unit-test-hook-disagg-leader-tsan-bucket01 at ~40% rate (2 of 5 reruns).
Read code
__wt_ref_addr_copy:
page = (WT_PAGE *)__wt_atomic_load_ptr_relaxed(&ref->home); addr = (WT_ADDR *)__wt_atomic_load_ptr_acquire(&ref->addr); if (addr == NULL) return (false); if (__wt_off_page(page, addr)) { /* crashes if page == NULL */ ... }
__wt_ref_is_root ([{{src/include/ref_inline.h:16}}|src/include/ref_inline.h]):
static WT_INLINE bool __wt_ref_is_root(WT_REF *ref) { return (__wt_tsan_suppress_load_wt_page_ptr_v(&ref->home) == NULL); }
The comment immediately above __wt_ref_addr_copy's ref->home load warns about exactly this kind of race:
To look at an on-page cell, we need to look at the parent page's disk image, and that can be dangerous. The problem is if the parent page splits, deepening the tree. As part of that process, the WT_REF WT_ADDRs pointing into the parent's disk image are copied into off-page WT_ADDRs and swapped into place before ref->home is updated to the new child page. Read ref->home before ref->addr with an acquire barrier in between, pairing with the sequentially consistent CAS on ref->addr during split.
But the load is relaxed, and __wt_off_page doesn't tolerate page == NULL:
static WT_INLINE bool __wt_off_page(WT_PAGE *page, const void *p) { return (page->dsk == NULL || p < (void *)page->dsk || p >= (void *)((uint8_t *)page->dsk + page->dsk->mem_size)); /* page->dsk → SIGSEGV */ }
Caller already filters internal pages
__evict_review_obsolete_time_window at [{{src/evict/evict_page.c:925-927}}|src/evict/evict_page.c] guarantees ref->page is a leaf:
WT_ASSERT(session, ref->page != NULL); if (WT_PAGE_IS_INTERNAL(ref->page)) return (0);
So this is not the btree->root ref (whose home is permanently NULL). It's a leaf whose home is NULL for a brief window during the parent's deepening split.
Suggested fix
Either:
- Change the relaxed load of ref->home (in _wt_ref_addr_copy and wt_ref_is_root) to an acquire load, and NULL-check the resulting page before wt_off_page. The early addr == NULL return already exists; symmetric handling for the home pointer is straightforward. _wt_ref_is_root should also fail-closed on transient NULL (treat as not-root and let the parent-walk logic re-resolve).
- Add the NULL guard inside _wt_off_page itself, returning true (treat as off-page) when page == NULL, and audit _wt_ref_is_root callers for the transient-NULL hazard.
(1) is more surgical and preserves the protocol described in the existing comment.
Reproduction
Both symptoms surfaced in CI when the eviction pipeline pushed pages faster than develop baseline. Develop's most recent run at commit beed624757 passes the same tasks, so the race is dormant under normal eviction pacing.
Related
- WT-17634 - sibling latent eviction-time race (__rec_root_write NULL btree->ckpt). Surfaces in a similar way: eviction-time reconcile racing with structural state transitions. Passes on develop but fires when eviction throughput rises.
Investigation context
The race surfaced while testing the WT-17236 dirty-index ring, which queues dirty leaves immediately on modify rather than via the walker's sampling cadence. The drain is not the cause - it accelerates exposure. WT-17236 currently masks both symptoms with a defensive gate that filters in-memory and disagg btrees; once this is fixed, that workaround can be removed.
- is related to
-
WT-17634 SIGSEGV in __wt_checkpoint_tree_reconcile_update reached from eviction-time reconcile
-
- Needs Scheduling
-