Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Fixed
Priority: Major - P3
Fix Version/s: WT12.0.0, 9.0.0-rc0
Affects Version/s: None
Component/s: Cache and Eviction
Labels:
None

Assigned Teams:

Storage Engines, Storage Engines - Persistence
Total Hours with Assigned Team:
777.297
Epic Link:
Use PALI block cache in production
Sprint:
SE Persistence backlog
Story Points:
None

Issue Summary

_evict_page_victim_cache (src/evict/evict_page.c) stores evicted clean leaf pages into the disaggregated victim cache. It is called from _evict_page_clean_update with no thread-type gating, so application threads, not just the eviction server and workers, run this code. We should decide whether it is acceptable for application threads to do this extra work, since they reach eviction precisely when the cache is already under pressure and they are on a latency-sensitive critical path.

Context

Two application-thread paths reach __evict_page_victim_cache, and neither sets WT_EVICT_CALL_CLOSING, so the victim-cache contribution is not skipped:

App-assist eviction under cache pressure: _wt_evict_app_assist_worker_check -> wti_evict_app_assist_worker -> wti_evict_page(session, is_server=false) -> wt_evict -> evict_page_clean_update -> _evict_page_victim_cache. Triggered from txn commit/rollback, the page-read wait loop, and compaction.
Forced/urgent eviction on page release: _wt_page_release_evict -> _wt_evict(... WT_EVICT_CALL_URGENT) -> same clean-update path.

Once past the early-return gates (disaggregated btree, cache-put available, clean non-root leaf with disk image, hot tier), the thread does the following inline while holding the page locked: full-page compression (_wt_blkcache_compress), a whole-buffer checksum (_wt_checksum), a page-header byteswap, and a plh_cache_put into the page-log layer (cost is implementation-defined and potentially the dominant term).

Trade-off: adding this work to an application thread extends a user-operation stall that is already happening under cache pressure; but the evicted page is hot and on the hot tier, so caching it on the way out avoids a likely-more-expensive re-fetch from the page-log layer later. There is currently no visibility into how much of the victim-cache fill comes from application threads.

Proposed Solution

Investigate and decide between the two options below. Note these are not mutually exclusive: option 3 can inform option 2.

Option 2 — Gate the victim-cache contribution by thread type. Make _evict_page_victim_cache (or its caller) skip the victim-cache put when the evicting thread is an application thread, at least for the interruptible app-assist path where the thread is least committed to the eviction. Forced/urgent eviction (_wt_page_release_evict) could still cache, since that page is definitely leaving the cache. The check would key off is_server / WT_SESSION_INTERNAL. Trade-off to evaluate: under heavy pressure application threads perform a large share of evictions, so gating them out could significantly reduce victim-cache coverage exactly when it matters most.

Option 3 — Add a statistic and measure first. Before changing any policy, add a connection statistic (e.g. block_cache_app_thread_puts) that counts victim-cache puts performed by application threads, so we can quantify what fraction of victim-cache fill is application-driven and what it costs. Use that data to decide whether option 2 (or an async plh_cache_put) is worth pursuing.

Suggested stats:

# Counters — frequency, split by role (alongside block_cache_cold_not_cached)
BlockCacheStat('block_cache_app_thread_puts',
    'pages added to the disaggregated victim cache during eviction by application threads'),
BlockCacheStat('block_cache_eviction_thread_puts',
    'pages added to the disaggregated victim cache during eviction by eviction threads'),

# Cost — time application threads spend on the victim-cache put path
EvictStat('eviction_app_victim_cache_time',
    'time (usecs) application threads spent putting pages into the victim cache', 'no_clear'),

Definition of Done

A recommendation documented on this ticket: keep current behavior, gate by thread type (option 2), and/or add measurement (option 3).
If option 3 is chosen as the first step, the stat is added and wired through codegen.
If option 2 is chosen, the gating is implemented with the interruptible vs forced-eviction distinction considered, and the coverage impact is assessed.

has to be done before

WT-17833 Do not let application threads insert pages into the disaggregated victim/block cache

Open

related to

WT-17832 Add victim-cache compression latency histogram stat

Closed

WT-17854 Add block_cache_put_time_max statistic for worst-case victim cache put latency

Closed

split from

WT-17801 Identify statistics to track block cache contention

Closed

Assignee:: Etienne Petrel
Reporter:: Etienne Petrel
Votes:: 0 Vote for this issue
Watchers:: 1 Start watching this issue

Created:: Jun 16 2026 05:45:26 AM UTC
Updated:: Jun 21 2026 03:15:08 AM UTC
Resolved:: Jun 17 2026 07:13:40 AM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates