-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Cache and Eviction
-
None
-
Storage Engines - Transactions
-
208.699
-
None
-
None
Summary
Eviction-time reconciliation crashes with SIGSEGV in _wt_checkpoint_tree_reconcile_update at src/checkpoint/checkpoint_txn.c:2675, reached from _rec_write_wrapup at src/reconcile/rec_write.c:3147 (the r->wrapup_checkpoint != NULL branch). At least three reproductions across two different tests and both the app-assist and eviction-server reaching paths suggest a real race between checkpoint state teardown and concurrent eviction-driven reconcile, not a per-test flake.
Failing line
2674 ckptbase = btree->ckpt; 2675 WT_CKPT_FOREACH (ckptbase, ckpt)
btree->ckpt is NULL when the WT_CKPT_FOREACH macro tries to iterate. The address fault offset is consistent across crashes.
Common path
__wt_checkpoint_tree_reconcile_update checkpoint_txn.c:2675 (NULL btree->ckpt deref) __rec_write_wrapup rec_write.c:3147 (r->wrapup_checkpoint != NULL branch -- root-page sync write) __reconcile rec_write.c:398 __wt_reconcile rec_write.c:127 __evict_reconcile evict_page.c:1284/1287 __wt_evict evict_page.c:450 __wti_evict_page evict_dispatch.c:254 ... reaching path varies (see below)
The crashing call site is __wt_checkpoint_tree_reconcile_update(session, &r->multi->addr.ta): the page being evicted has wrap-up checkpoint data to write, the reconcile assumes btree->ckpt is set, but it has been torn down by another thread before this thread reaches the assumption.
Reproductions
1. Eviction-server thread (most recent)
Patch: 6a12c151bf629c0007cf2c68
Task: wiredtiger_ubuntu2004_nonstandalone_unit_test_bucket04_*_26_05_24_09_13_55
Test: test_truncate19.test_truncate19.test_truncate19(string_row)
Build: ubuntu2004-nonstandalone
Tail of stack:
#7 __wti_evict_lru_pages evict_queue.c:140 #8 __evict_thread_run evict_thread.c:117 #9 __thread_run thread_group.c:32
2. App-thread assist eviction
Patch: 6a0e5bd16d8e8e00072d3612
Task: wiredtiger_ubuntu2004_minimal_csuite_tests_fast_*_26_05_21_01_11_51
Test: test_random_directio (cycle 2/5 child)
Build: ubuntu2004-minimal csuite
Tail of stack:
#7 __wti_evict_app_assist_worker evict_dispatch.c:385 #8 __wt_evict_app_assist_worker_check evict_inline.h:967 #9 __wt_txn_commit txn.c:1862 #10 __session_commit_transaction session_api.c:1974 #11 thread_run test/csuite/random_directio/main.c:556
3. Earlier sighting
A third sighting on an earlier csuite-tests-fast run showed the same __wt_checkpoint_tree_reconcile_update signature on another test. Less detail captured at the time, but the line number and surrounding frames matched the two above.
Related but distinct
WT-17488 (closed) — same family of bug (NULL/empty btree->ckpt array access) but fixed in _checkpoint_mark_skip, a different function. Fix is already in develop and does not prevent this crash in _wt_checkpoint_tree_reconcile_update.
WT-15294 — disagg test_prepare20 crash in checkpoint, similar checkpoint/reconcile interaction signature but in the disagg path; this one is on plain nonstandalone.
Reproducibility
Three sightings across two tests over four days. The two stacks above are in different reaching contexts (app-assist eviction vs eviction-server), but converge on the same final two frames. Strongly suggests a genuine race condition, not test-specific flakiness.
Suggested investigation
- Audit the lifetime of btree->ckpt: who allocates, who clears, and what synchronization guards a reconcile path that calls __wt_checkpoint_tree_reconcile_update against concurrent teardown.
- In __rec_write_wrapup at line 3147, the wrapup_checkpoint != NULL branch is reached during eviction reconcile of root pages. Verify whether eviction reconcile should ever take this branch when checkpoint state has been torn down on the btree.
- Consider whether btree->ckpt = NULL after checkpoint completion needs a barrier or interlock with in-flight reconciles.