Push-model dirty-page index for eviction candidate discovery

XMLWordPrintableJSON

    • Type: Improvement
    • Resolution: Duplicate
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Cache and Eviction
    • None
    • Storage Engines, Storage Engines - Transactions
    • 51.034
    • None
    • None

      Motivation

      On large caches (18 GB+) under high update pressure, the eviction walker is
      the candidate-discovery bottleneck: it samples the btree hoping to find the
      oldest dirty pages before pressure reaches the trigger, and on fast dirty
      generation it cannot keep up. The WT-17234 investigation demonstrated this
      across eight tuning attempts -- deeper sampling, walker persistence,
      dominance gating, cache-fill gates -- every variant that helped one workload
      class hurt another. FTDC on the failing approaches showed the walker doing
      1.83x the work of baseline while delivering 32% lower insert throughput.

      This ticket proposes and prototypes a structurally different approach: drive
      dirty-candidate discovery from the write path instead of from walker
      sampling. The walker stops being the throughput-limiting component; it only
      has to pop entries the producer has already identified.

      Design

      Producer (modify path)

      Every successful cursor modify records the dirty leaf ref into a per-btree
      ring. Sized proportionally to cache (500 slots per GB, clamped between 4096
      and 262144). Insert is a trylock + atomic head advance, so producer
      contention degrades gracefully: a missed insert is a performance hint
      missed, not a correctness issue. The walker remains a safety net.

      Consumer (eviction walker)

      The walker drains the ring at the start of each per-btree visit. Each ref is
      guarded by a hazard pointer (*wt_hazard_set) before dereferencing
      ref->page, then handed to *evict_try_queue_page -- the same gate
      the tree walker uses. Stale refs (page freed, split in progress, already
      queued, etc.) are skipped. Whatever the drain does not fill, the tree walker
      handles as before.

      Observability (FTDC)

      Six new CacheStat entries for rate and safety metrics:
      * cache_eviction_dirty_index_insert -- producer rate
      * cache_eviction_dirty_index_insert_contended -- trylock drops
      * cache_eviction_dirty_index_overwrite -- ring wrap-around (producer
      faster than consumer)
      * cache_eviction_dirty_index_scanned -- drain slots examined
      * cache_eviction_dirty_index_hit -- refs successfully queued
      * cache_eviction_dirty_index_stale -- slots filtered as stale

      Current prototype status

      The prototype is on the wt-dirty-index branch, rebased directly on
      develop (not piggybacked on the WT-17234 eviction_queue_scale branch
      -- a separate, clean baseline).

      Producer + drain code is fully implemented. Drain is gated behind a
      compile-time WTI_DIRTY_INDEX_DRAIN_ENABLED flag, default 0. Format
      stress surfaced key-order corruption when the drain was enabled
      unconditionally: hazard pointers alone do not coordinate with concurrent
      page splits that mutate key ordering. Getting the consumer safety right
      requires deeper integration with __wt_tree_walk_count semantics
      (follow-up work on this ticket).

      With the drain disabled the producer still fires, letting us measure the
      candidate production rate and producer overhead without risking data
      corruption.

      What this ticket is for

      • Drive the push-model approach to completion on its own branch.
      • Make the drain safe to enable in production (hazard + split-coordination).
      • Measure -- and if positive, ship -- the end-to-end win on YCSB load and
        large-cache / high-dirty workloads where the walker alone cannot keep up.

      Related

      • WT-17234 -- the eight-approach investigation that established the walker
        itself is the bottleneck at scale. This ticket carries the lessons forward
        into a different class of fix.
      • WT-15538 -- umbrella ticket for slow eviction under high update ratio.
      • WT-16529 -- queue usage / empty-queue investigation (pull-side tuning,
        complementary).
      • WT-16665 -- dynamic queue resize (pull-side tuning, complementary).

      Open questions / follow-ups

      • Consumer safety under splits. Current drain races with split completion on
        key-order; the drain must either coordinate with the split lock or filter
        split-in-progress refs. Needs investigation; likely the correct answer is
        to check ref->home generation or similar split-safe identifier.
      • Whether the ring should also admit dirty-side reads that trigger
        WT_PAGE_EVICT_LRU_URGENT (currently only cursor modify paths feed it).
      • Interaction with disaggregated storage eviction constraints
        (materialization frontier, PALI page-server admission). The drain path
        re-uses __evict_try_queue_page so those gates should already fire,
        but needs explicit sys-perf testing.

      Branch

      Primary branch: wt-dirty-index (rebased on develop).

            Assignee:
            [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            Haribabu Kommi
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: