Disagg secondary: checkpoint lock serializes all reader threads on each checkpoint pickup

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Checkpoints, DHandles
    • None
    • Storage Engines - Persistence
    • 46.977
    • None
    • None

      Problem

      On a disaggregated storage secondary, reader threads accumulate 70.4 seconds of total checkpoint lock wait time (`checkpoint lock application thread wait time`) over a 15-minute mixed-workload run. With checkpoints arriving from the primary approximately every 16 seconds, each checkpoint application blocks all reader threads for a meaningful interval, directly contributing to the 29–38 ms average command latency on the secondary.

      Root Cause

      When the secondary picks up a new checkpoint from the primary, `_disagg_apply_checkpoint_meta` ([src/conn/conn_layered.c](src/conn/conn_layered.c)) runs under the schema lock and the checkpoint lock. This is necessary to atomically update local metadata and mark affected dhandles outdated. However, the checkpoint lock is held for the full duration of metadata application — iterating through all tables in the checkpoint, updating each table's checkpoint config, and calling `_wti_conn_dhandle_outdated` for each.

      During this window, any reader thread that needs the checkpoint lock (e.g., to verify snapshot state or read timestamps) blocks. With 231 concurrent sessions and checkpoints every ~16 s, the cumulative wait is 70.4 s — averaging ~2.3 s of aggregate reader blocking per checkpoint applied.

      Evidence (FTDC, 2026-06-02 disagg mixed-workload run)

      Metric Value
      `checkpoint lock application thread wait time` 70,443,950 µs (70.4 s cumulative)
      Checkpoints applied from primary (stable_adv) 30 over ~8 min = 1 per ~16 s
      Average command latency (secondary) 29,606–38,443 µs
      Average read latency (secondary) 685 µs
      Concurrent connections (secondary) ~231
      `apply checkpoint metadata most recent time` 200 ms (last checkpoint)
      `schema lock application thread wait time` 93.5 ms cumulative

      The gap between average read latency (685 µs) and average command latency (29–38 ms) is the clearest signal: commands are being stalled by lock contention, not by I/O or computation.

      Impact

      All reader sessions on the secondary are serialized behind checkpoint application. Because checkpoint pickup is continuous (as long as the primary is writing), there is no steady-state period where readers are unaffected. The effect scales with the number of concurrent sessions and the number of tables in the checkpoint.

      Suggested Direction

      • Reduce the scope of work done under the checkpoint lock during pickup — defer dhandle invalidation or table metadata updates to outside the lock where possible.
      • Explore applying checkpoint metadata incrementally (per-table) without holding the checkpoint lock across all tables, releasing and reacquiring between tables to give readers a chance to proceed.
      • Profile whether `apply checkpoint metadata most recent time` (200 ms in this run) can be reduced by caching or batching the metadata diff.

            Assignee:
            Peter Macko
            Reporter:
            Chenhao Qu
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: