-
Type:
Task
-
Resolution: Done
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Block Cache
-
None
-
Storage Engines - Persistence
-
371.154
-
SE Persistence backlog
-
None
Motivation
Before enabling the block cache in production we need visibility into lock contention it may introduce. The cache is a sharded ConcurrentSizedLRUCache (src/mongo/db/modules/atlas/src/disagg_storage/pali/block_cache/sized_lru_cache.h); every find/add/erase/getErase acquires the per-shard synchronized_value mutex, and the global Counter64 hit/miss counters in pali_block_cache.cpp are incremented on every get/put. Under high throughput these are the natural contention points, but today nothing measures them.
Approach
- Instrument per-shard mutex wait time / contended-acquire counts in ConcurrentSizedLRUCache (sized_lru_cache.h).
- Surface the new counters in the existing blockCache ServerStatus section (pali_block_cache.cpp, BlockCacheServerStatus) so they flow into FTDC.
- Keep the instrumentation cheap (e.g. relaxed atomics / try-lock fast path) so it does not itself become a bottleneck.
Definition of Done
- New contention stat(s) appear under db.serverStatus().blockCache and in FTDC.
- Stats are validated by a unit test in pali_test.cpp (or sized_lru_cache_test.cpp) that drives concurrent access and observes the counters move.
- Negligible overhead when the cache is uncontended.
- split to
-
SERVER-129300 Assess adding per-shard metrics to the victim block cache to detect shard contention
-
- Open
-
-
WT-17832 Add victim-cache compression latency histogram stat
-
- Closed
-
-
WT-17850 Investigate whether application threads should contribute to the disaggregated victim cache during eviction
-
- Closed
-
-
WT-17854 Add block_cache_put_time_max statistic for worst-case victim cache put latency
-
- Closed
-