Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Checkpoints, DHandles
Labels:
None

Assigned Teams:

Storage Engines - Persistence
Total Hours with Assigned Team:
1,137.737
Epic Link:
Improve dhandle scaling in disagg (Post PuP)
Sprint:
SE Persistence backlog
Story Points:
None

Problem

On a disaggregated storage secondary, reader threads accumulate 70.4 seconds of total checkpoint lock wait time (`checkpoint lock application thread wait time`) over a 15-minute mixed-workload run. With checkpoints arriving from the primary approximately every 16 seconds, each checkpoint application blocks all reader threads for a meaningful interval, directly contributing to the 29–38 ms average command latency on the secondary.

Root Cause

When the secondary picks up a new checkpoint from the primary, `_disagg_apply_checkpoint_meta` ([src/conn/conn_layered.c](src/conn/conn_layered.c)) runs under the schema lock and the checkpoint lock. This is necessary to atomically update local metadata and mark affected dhandles outdated. However, the checkpoint lock is held for the full duration of metadata application — iterating through all tables in the checkpoint, updating each table's checkpoint config, and calling `_wti_conn_dhandle_outdated` for each.

During this window, any reader thread that needs the checkpoint lock (e.g., to verify snapshot state or read timestamps) blocks. With 231 concurrent sessions and checkpoints every ~16 s, the cumulative wait is 70.4 s — averaging ~2.3 s of aggregate reader blocking per checkpoint applied.

Evidence (FTDC, 2026-06-02 disagg mixed-workload run)

Metric	Value
`checkpoint lock application thread wait time`	70,443,950 µs (70.4 s cumulative)
Checkpoints applied from primary (stable_adv)	30 over ~8 min = 1 per ~16 s
Average command latency (secondary)	29,606–38,443 µs
Average read latency (secondary)	685 µs
Concurrent connections (secondary)	~231
`apply checkpoint metadata most recent time`	200 ms (last checkpoint)
`schema lock application thread wait time`	93.5 ms cumulative

The gap between average read latency (685 µs) and average command latency (29–38 ms) is the clearest signal: commands are being stalled by lock contention, not by I/O or computation.

Impact

All reader sessions on the secondary are serialized behind checkpoint application. Because checkpoint pickup is continuous (as long as the primary is writing), there is no steady-state period where readers are unaffected. The effect scales with the number of concurrent sessions and the number of tables in the checkpoint.

Suggested Direction

Reduce the scope of work done under the checkpoint lock during pickup — defer dhandle invalidation or table metadata updates to outside the lock where possible.
Explore applying checkpoint metadata incrementally (per-table) without holding the checkpoint lock across all tables, releasing and reacquiring between tables to give readers a chance to proceed.
Profile whether `apply checkpoint metadata most recent time` (200 ms in this run) can be reduced by caching or batching the metadata diff.

Assignee:: Peter Macko
Reporter:: Chenhao Qu
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: Jun 04 2026 11:11:44 PM UTC
Updated:: Jul 20 2026 10:32:42 PM UTC

Details

Description

Problem

Root Cause

Evidence (FTDC, 2026-06-02 disagg mixed-workload run)

Impact

Suggested Direction

Attachments

Activity

People

Dates