Skip disagg btrees visited by ongoing checkpoint during eviction walk

XMLWordPrintableJSON

    • Storage Engines - Transactions
    • 959.13
    • SE Transactions - 2026-05-22, SE Transactions - 2026-06-05
    • 3

      Background

      Spun out of WT-17580. On disaggregated storage (DSC), cache.evict page failures (eviction worker threads) is ~400x higher than on ASC. Investigation traced a significant fraction of the worker failures to the post-lock recheck at __wt_page_can_evict (btree_inline.h:2383) firing cache_eviction_blocked_disagg_next_checkpoint.

      Once checkpoint has visited a disagg btree (btree-\>checkpoint_gen == WT_GEN_CHECKPOINT), every modified page in that btree stays unevictable for the rest of the global checkpoint cycle — the server keeps queuing them but the worker is guaranteed to fail at the post-lock recheck.

      Proposal

      Add a tree-level skip in __wti_evict_walk (evict_walk.c:410-495) with the predicate:

      disagg_conn && btree->checkpoint_gen == WT_GEN_CHECKPOINT && txn_global.checkpoint_running
      

      Lifecycle confirms this is safe:

      • checkpoint_txn.c:1968 flips checkpoint_running to true before any btree is visited, false after the cycle ends.
      • checkpoint_txn.c:315 (__wt_checkpoint_update_generation) bumps btree-\>checkpoint_gen when "the tree will not be visited again by the current checkpoint." Until the global cycle ends, every modified page in that btree belongs to the next checkpoint and is permanently unevictable.

      There is already a sibling skip eviction_server_skip_checkpointing_trees (evict_walk.c:425) that covers the narrow window when WT_BTREE_SYNCING(btree) is true (the btree's own sync phase). Once that flips off mid-checkpoint, the new skip covers the rest of the global checkpoint cycle.

      Design choices to settle

      • Whole-tree vs dirty-only skip. The btree_inline.h:2377 block only fires for modified pages, so a whole-tree skip would also block legitimate clean-page eviction (relevant for the stable btree, which is read-mostly). Safer to gate on the eviction mode: skip only when the pass targets dirty or updates (F_ISSET(evict, WT_EVICT_CACHE_DIRTY | WT_EVICT_CACHE_UPDATES) and not WT_EVICT_CACHE_CLEAN-only).
      • Stat naming. Add a new stat (e.g. eviction_server_skip_trees_checkpoint_pending) so cache_eviction_blocked_disagg_next_checkpoint remains meaningful as "this slipped through to post-lock" and the residual is measurable.

      Expected impact

      Reduces eviction_worker_evict_fail on DSC by removing the dominant post-lock failure mode. Also increases the eviction candidate pool's effectiveness — workers spend lock time on pages that can actually be evicted.

      Validation

      FTDC deltas to compare before/after, broken out per btree:

      • cache_eviction_blocked_disagg_next_checkpoint (expected to drop sharply)
      • eviction_worker_evict_fail (expected to drop)
      • eviction_worker_evict_attempt (expected to rise — fewer failed attempts means more successful work per attempt)
      • New eviction_server_skip_trees_checkpoint_pending (new visibility)

      Related: WT-17580, BF-43333.

            Assignee:
            Chenhao Qu
            Reporter:
            Chenhao Qu
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: