Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Unresolved
Priority: Minor - P4
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Storage Engines - Persistence
Sprint:
SE Persistence backlog
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Background

The victim block cache is a sharded LRU cache (ConcurrentSizedLRUCache in src/mongo/db/modules/atlas/src/disagg_storage/pali/block_cache/sized_lru_cache.h). The cache is split into disaggBlockCacheNumShards independent SizedLRUCache instances, each guarded by its own stdx::mutex (via synchronized_value). Every operation selects a shard with shardIndex = hash(key) % nShards and locks only that shard.

Today the cache exposes only aggregate statistics (see src/mongo/db/modules/atlas/src/disagg_storage/pali/block_cache/pali_block_cache.cpp):

serverStatus.blockCache.{maxsize,size,count}
requests.{total,cached,noncached} (hit rate)

There are no per-shard or contention metrics. The shard mutex is a plain std::mutex, which carries no built-in instrumentation.

Motivation

We are tuning disaggBlockCacheNumShards (sys-perf tpcc_majority sweeps on disagg-m8g-perf-11-node.arm.aws). The aggregate hit rate tells us whether the cache holds useful pages, but it cannot tell us:

whether load is skewed – is one shard receiving disproportionately more requests than others?
whether that skew actually causes lock contention (threads blocking to acquire a shard mutex).

Without these signals we cannot confidently decide whether increasing the shard count will help. This ticket is to assess the feasibility and design of such metrics.

Key distinction

Request rate per shard is not the same as contention, and they imply different fixes:

Load skew – a shard receives more requests. Adding shards helps only if load is roughly uniform but high-volume (spreads it across more locks). If skew comes from a few hot keys, those keys still map to one shard regardless of shard count, so more shards will not help.
Lock contention – threads actually block on a shard mutex. This is what degrades throughput. The truest signal is lock wait time, not request count: a hot shard can be uncontended if its critical sections are short (here, the memcpy happens outside the lock).

Actionable rule: more shards helps only when there is measurable lock wait time and load is reasonably uniform across shards.

Possible approaches

Profile lock waits (no code change). Since each shard is a std::mutex (futex on Linux), off-CPU / wait profiling (perf record -e sched:sched_switch, perf lock, bpftrace on the futex syscall, or mutrace) reveals time blocked under ConcurrentSizedLRUCache::add / getErase. Zero rebuild; good first triage. Downside: not available in production FTDC, requires a repro under a profiler.
Per-shard request counters. Add a std::vector<Atomic<uint64_t>> parallel to _shards, increment at the four call sites (add, getErase, erase, hasKey), and emit via the blockCache serverStatus section. Compute skew (coefficient of variation, or max/mean) to distinguish uniform load (more shards helps) from a single hot shard. Optionally also emit per-shard distinct-key counts to separate "many warm keys" from "one hot key".
Lock wait-time metric (gold signal). Time the lock acquisition: try_lock() first, and on failure accumulate the blocked duration. Expose aggregate (optionally per-shard) lockWaitMicros and lockContendedAcquires. Rising wait time as concurrency increases is exactly the condition under which more shards pays off – strictly more informative than request counts.

Recommended: implement approaches 2 and 3 so both the skew and the contention signals come straight out of FTDC; use approach 1 as a zero-cost first check. Care should be taken to keep the added atomics off the hot path's critical section (relaxed atomics, incremented outside the lock where possible).

split from

WT-17801 Identify statistics to track block cache contention

Closed

Assignee:: Unassigned
Reporter:: Etienne Petrel
Participants:: Etienne Petrel
Votes:: 0 Vote for this issue
Watchers:: 1 Start watching this issue

Created:: Jun 18 2026 03:36:27 AM UTC
Updated:: Jul 09 2026 09:39:20 PM UTC

Details

Description

Background

Motivation

Key distinction

Possible approaches

Attachments

Issue Links

Activity

People

Dates

PagerDuty