-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Checkpoints, DHandles
-
None
-
Storage Engines - Persistence
-
46.977
-
None
-
None
Problem
On a disaggregated storage secondary, reader threads accumulate 70.4 seconds of total checkpoint lock wait time (`checkpoint lock application thread wait time`) over a 15-minute mixed-workload run. With checkpoints arriving from the primary approximately every 16 seconds, each checkpoint application blocks all reader threads for a meaningful interval, directly contributing to the 29–38 ms average command latency on the secondary.
Root Cause
When the secondary picks up a new checkpoint from the primary, `_disagg_apply_checkpoint_meta` ([src/conn/conn_layered.c](src/conn/conn_layered.c)) runs under the schema lock and the checkpoint lock. This is necessary to atomically update local metadata and mark affected dhandles outdated. However, the checkpoint lock is held for the full duration of metadata application — iterating through all tables in the checkpoint, updating each table's checkpoint config, and calling `_wti_conn_dhandle_outdated` for each.
During this window, any reader thread that needs the checkpoint lock (e.g., to verify snapshot state or read timestamps) blocks. With 231 concurrent sessions and checkpoints every ~16 s, the cumulative wait is 70.4 s — averaging ~2.3 s of aggregate reader blocking per checkpoint applied.
Evidence (FTDC, 2026-06-02 disagg mixed-workload run)
| Metric | Value |
|---|---|
| `checkpoint lock application thread wait time` | 70,443,950 µs (70.4 s cumulative) |
| Checkpoints applied from primary (stable_adv) | 30 over ~8 min = 1 per ~16 s |
| Average command latency (secondary) | 29,606–38,443 µs |
| Average read latency (secondary) | 685 µs |
| Concurrent connections (secondary) | ~231 |
| `apply checkpoint metadata most recent time` | 200 ms (last checkpoint) |
| `schema lock application thread wait time` | 93.5 ms cumulative |
The gap between average read latency (685 µs) and average command latency (29–38 ms) is the clearest signal: commands are being stalled by lock contention, not by I/O or computation.
Impact
All reader sessions on the secondary are serialized behind checkpoint application. Because checkpoint pickup is continuous (as long as the primary is writing), there is no steady-state period where readers are unaffected. The effect scales with the number of concurrent sessions and the number of tables in the checkpoint.
Suggested Direction
- Reduce the scope of work done under the checkpoint lock during pickup — defer dhandle invalidation or table metadata updates to outside the lock where possible.
- Explore applying checkpoint metadata incrementally (per-table) without holding the checkpoint lock across all tables, releasing and reacquiring between tables to give readers a chance to proceed.
- Profile whether `apply checkpoint metadata most recent time` (200 ms in this run) can be reduced by caching or batching the metadata diff.