Disagg secondary: mass dhandle invalidation per checkpoint causes reader stampede on dhandle write lock

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: DHandles
    • None
    • Storage Engines - Persistence
    • 46.982
    • None
    • None

      Problem

      On a disaggregated storage secondary, the `dhandle_lock_blocked` (thread-yield / "data handle lock yielded") counter reaches 25.6 million events over a 15-minute mixed-workload run. This is the dominant source of the 29–38 ms average command latency observed on the secondary, well above the ~685 µs average read latency.

      Root Cause

      `_disagg_apply_checkpoint_meta` ([src/conn/conn_layered.c](src/conn/conn_layered.c)) calls `_wti_conn_dhandle_outdated()` for every table on every checkpoint pickup — twice per table (once for the old checkpoint dhandle, once for the live-btree dhandle). This marks all matching dhandles `WT_DHANDLE_OUTDATED` while holding the schema lock.

      After the schema lock is released, the next access by any session finds its cached dhandle marked `OUTDATED` in `_session_find_dhandle` ([src/session/session_dhandle.c:139-143](src/session/session_dhandle.c)), discards it, and attempts to reopen. Reopening a not-yet-open dhandle requires an exclusive write lock (`_wt_session_lock_dhandle`). With ~231 concurrent reader sessions all invalidated simultaneously for all tables, they all race to acquire the same exclusive write locks. Each failed attempt spins through the yield loop at [src/session/session_dhandle.c:280-285](src/session/session_dhandle.c) and increments `dhandle_lock_blocked`.

      With 30 checkpoints applied in 15 minutes (~1 per 16 s), the stampede recurs on every checkpoint across all tables and all concurrent sessions, accumulating to 25.6M yield events.

      Evidence (FTDC, 2026-06-02 disagg mixed-workload run)

      Metric Value
      `dhandle_lock_blocked` (secondary) 25,621,687 events
      Checkpoints applied from primary 30 (~1 every 16 s)
      `apply checkpoint metadata most recent time` 200 ms (last snapshot)
      Average command latency (secondary) 29,606–38,443 µs
      Concurrent connections (secondary) ~231
      `checkpoint lock application thread wait time` 70.4 s cumulative (separate issue)

      Impact

      Command latency on the secondary is 40–55× the raw read latency. Every MongoDB driver heartbeat, cursor getMore, and aggregate command is affected. The contention recurs on each checkpoint pickup regardless of read workload size.

      Suggested Direction

      Instead of invalidating all dhandles globally on every checkpoint pickup, consider:

      • Tracking which dhandles actually changed (by checkpoint name diff) and only marking those outdated, rather than marking the live-btree dhandle for every table unconditionally (the `TODO` comment at [conn_layered.c:508-513](src/conn/conn_layered.c) already notes this should be done at step-up/step-down).
      • Coordinating re-opens so that only one thread per dhandle does the open and others wait on a condition rather than spinning.

            Assignee:
            Peter Macko
            Reporter:
            Chenhao Qu
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: