Assess adding per-shard metrics to the victim block cache to detect shard contention

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Unresolved
    • Priority: Minor - P4
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Storage Engines - Persistence
    • SE Persistence backlog
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Background

      The victim block cache is a sharded LRU cache (ConcurrentSizedLRUCache in src/mongo/db/modules/atlas/src/disagg_storage/pali/block_cache/sized_lru_cache.h). The cache is split into disaggBlockCacheNumShards independent SizedLRUCache instances, each guarded by its own stdx::mutex (via synchronized_value). Every operation selects a shard with shardIndex = hash(key) % nShards and locks only that shard.

      Today the cache exposes only aggregate statistics (see src/mongo/db/modules/atlas/src/disagg_storage/pali/block_cache/pali_block_cache.cpp):

      • serverStatus.blockCache.{maxsize,size,count}
      • requests.{total,cached,noncached} (hit rate)

      There are no per-shard or contention metrics. The shard mutex is a plain std::mutex, which carries no built-in instrumentation.

      Motivation

      We are tuning disaggBlockCacheNumShards (sys-perf tpcc_majority sweeps on disagg-m8g-perf-11-node.arm.aws). The aggregate hit rate tells us whether the cache holds useful pages, but it cannot tell us:

      • whether load is skewed – is one shard receiving disproportionately more requests than others?
      • whether that skew actually causes lock contention (threads blocking to acquire a shard mutex).

      Without these signals we cannot confidently decide whether increasing the shard count will help. This ticket is to assess the feasibility and design of such metrics.

      Key distinction

      Request rate per shard is not the same as contention, and they imply different fixes:

      • Load skew – a shard receives more requests. Adding shards helps only if load is roughly uniform but high-volume (spreads it across more locks). If skew comes from a few hot keys, those keys still map to one shard regardless of shard count, so more shards will not help.
      • Lock contention – threads actually block on a shard mutex. This is what degrades throughput. The truest signal is lock wait time, not request count: a hot shard can be uncontended if its critical sections are short (here, the memcpy happens outside the lock).

      Actionable rule: more shards helps only when there is measurable lock wait time and load is reasonably uniform across shards.

      Possible approaches

      1. Profile lock waits (no code change). Since each shard is a std::mutex (futex on Linux), off-CPU / wait profiling (perf record -e sched:sched_switch, perf lock, bpftrace on the futex syscall, or mutrace) reveals time blocked under ConcurrentSizedLRUCache::add / getErase. Zero rebuild; good first triage. Downside: not available in production FTDC, requires a repro under a profiler.
      2. Per-shard request counters. Add a std::vector<Atomic<uint64_t>> parallel to _shards, increment at the four call sites (add, getErase, erase, hasKey), and emit via the blockCache serverStatus section. Compute skew (coefficient of variation, or max/mean) to distinguish uniform load (more shards helps) from a single hot shard. Optionally also emit per-shard distinct-key counts to separate "many warm keys" from "one hot key".
      3. Lock wait-time metric (gold signal). Time the lock acquisition: try_lock() first, and on failure accumulate the blocked duration. Expose aggregate (optionally per-shard) lockWaitMicros and lockContendedAcquires. Rising wait time as concurrency increases is exactly the condition under which more shards pays off – strictly more informative than request counts.

      Recommended: implement approaches 2 and 3 so both the skew and the contention signals come straight out of FTDC; use approach 1 as a zero-cost first check. Care should be taken to keep the added atomics off the hot path's critical section (relaxed atomics, incremented outside the lock where possible).

            Assignee:
            Unassigned
            Reporter:
            Etienne Petrel
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: