SIGSEGV in __wt_checkpoint_tree_reconcile_update reached from eviction-time reconcile

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Cache and Eviction
    • None
    • Storage Engines - Transactions
    • 208.699
    • None
    • None

      Summary

      Eviction-time reconciliation crashes with SIGSEGV in _wt_checkpoint_tree_reconcile_update at src/checkpoint/checkpoint_txn.c:2675, reached from _rec_write_wrapup at src/reconcile/rec_write.c:3147 (the r->wrapup_checkpoint != NULL branch). At least three reproductions across two different tests and both the app-assist and eviction-server reaching paths suggest a real race between checkpoint state teardown and concurrent eviction-driven reconcile, not a per-test flake.

      Failing line

      2674    ckptbase = btree->ckpt;
      2675    WT_CKPT_FOREACH (ckptbase, ckpt)
      

      btree->ckpt is NULL when the WT_CKPT_FOREACH macro tries to iterate. The address fault offset is consistent across crashes.

      Common path

      __wt_checkpoint_tree_reconcile_update   checkpoint_txn.c:2675   (NULL btree->ckpt deref)
      __rec_write_wrapup                      rec_write.c:3147        (r->wrapup_checkpoint != NULL branch -- root-page sync write)
      __reconcile                             rec_write.c:398
      __wt_reconcile                          rec_write.c:127
      __evict_reconcile                       evict_page.c:1284/1287
      __wt_evict                              evict_page.c:450
      __wti_evict_page                        evict_dispatch.c:254
      ... reaching path varies (see below)
      

      The crashing call site is __wt_checkpoint_tree_reconcile_update(session, &r->multi->addr.ta): the page being evicted has wrap-up checkpoint data to write, the reconcile assumes btree->ckpt is set, but it has been torn down by another thread before this thread reaches the assumption.

      Reproductions

      1. Eviction-server thread (most recent)

      Patch: 6a12c151bf629c0007cf2c68
      Task: wiredtiger_ubuntu2004_nonstandalone_unit_test_bucket04_*_26_05_24_09_13_55
      Test: test_truncate19.test_truncate19.test_truncate19(string_row)
      Build: ubuntu2004-nonstandalone

      Tail of stack:

      #7 __wti_evict_lru_pages    evict_queue.c:140
      #8 __evict_thread_run       evict_thread.c:117
      #9 __thread_run             thread_group.c:32
      

      2. App-thread assist eviction

      Patch: 6a0e5bd16d8e8e00072d3612
      Task: wiredtiger_ubuntu2004_minimal_csuite_tests_fast_*_26_05_21_01_11_51
      Test: test_random_directio (cycle 2/5 child)
      Build: ubuntu2004-minimal csuite

      Tail of stack:

      #7 __wti_evict_app_assist_worker          evict_dispatch.c:385
      #8 __wt_evict_app_assist_worker_check     evict_inline.h:967
      #9 __wt_txn_commit                        txn.c:1862
      #10 __session_commit_transaction          session_api.c:1974
      #11 thread_run                            test/csuite/random_directio/main.c:556
      

      3. Earlier sighting

      A third sighting on an earlier csuite-tests-fast run showed the same __wt_checkpoint_tree_reconcile_update signature on another test. Less detail captured at the time, but the line number and surrounding frames matched the two above.

      Related but distinct

      WT-17488 (closed) — same family of bug (NULL/empty btree->ckpt array access) but fixed in _checkpoint_mark_skip, a different function. Fix is already in develop and does not prevent this crash in _wt_checkpoint_tree_reconcile_update.

      WT-15294 — disagg test_prepare20 crash in checkpoint, similar checkpoint/reconcile interaction signature but in the disagg path; this one is on plain nonstandalone.

      Reproducibility

      Three sightings across two tests over four days. The two stacks above are in different reaching contexts (app-assist eviction vs eviction-server), but converge on the same final two frames. Strongly suggests a genuine race condition, not test-specific flakiness.

      Suggested investigation

      1. Audit the lifetime of btree->ckpt: who allocates, who clears, and what synchronization guards a reconcile path that calls __wt_checkpoint_tree_reconcile_update against concurrent teardown.
      2. In __rec_write_wrapup at line 3147, the wrapup_checkpoint != NULL branch is reached during eviction reconcile of root pages. Verify whether eviction reconcile should ever take this branch when checkpoint state has been torn down on the btree.
      3. Consider whether btree->ckpt = NULL after checkpoint completion needs a barrier or interlock with in-flight reconciles.

            Assignee:
            [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            Haribabu Kommi
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: