-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: Checkpoints
-
Storage Engines - Transactions
-
10.828
-
SE Transactions - 2026-07-03
-
1
-
48
Problem
__disagg_pick_up_checkpoint is supposed to refuse (panic) adopting a checkpoint whose oldest_timestamp exceeds the connection's current pinned timestamp - protecting any actively-pinned reader from having its data pruned out from under it. The check, as written, can never fire.
Root cause
In src/conn/conn_layered_checkpoint_pick_up.c, __disagg_pick_up_checkpoint:
- WT_DISAGG_METADATA metadata is declared and WT_CLEAR(metadata)'d near the top of the function.
- The pinned-timestamp panic check runs immediately after, reading metadata.oldest_timestamp - which is still zero from the clear.
- Only afterward does _wti_disagg_fetch_shared_meta + _wt_disagg_parse_meta actually populate metadata with the real checkpoint fields, including the real oldest_timestamp.
So the check is effectively 0 > pinned_timestamp, which is essentially always false (pinned_timestamp, when set, is >= 1; when unset the != WT_TS_NONE guard already skips the check). The safety net is dead code - a checkpoint whose real oldest_timestamp exceeds an actively-pinned reader's timestamp can be silently adopted.
Fix
Move the pinned-timestamp check (the _wt_txn_update_pinned_timestamp refresh + the comparison + panic) to after the _wt_disagg_parse_meta call, so it reads the real, parsed metadata.oldest_timestamp instead of the pre-parse zeroed placeholder. No other logic changes.
Verification
Confirmed with a scripted single-scenario repro that forces a checkpoint whose real oldest_timestamp exceeds an active reader's pinned read timestamp:
- Before the fix: adoption silently succeeds; the stale pin later surfaces as an unrelated, confusing crash in __clayered_reopen_stable (see the companion ticket for that assertion) when the reader's cursor tries to advance.
- After the fix: __disagg_pick_up_checkpoint correctly panics at adoption time with "Disaggregated storage checkpoint oldest_timestamp ... is greater than the current pinned timestamp ...".
Open question for the fix's severity/design
Once this check correctly fires, it does so via a hard __wt_panic (full process abort + restart), not a graceful refusal or a per-reader failure. A reader that simply falls behind the checkpoint cadence (e.g. a slow scan under sustained write load) will now crash the whole node rather than just failing or blocking that one reader. Worth a design discussion on whether panic is the intended response here, or whether checkpoint adoption should instead be deferred/refused non-fatally when this condition is detected.
Relationship to the __clayered_reopen_stable assertion
This is a separate, non-overlapping bug from the "upgrading a positioned stable cursor" assertion in __clayered_reopen_stable (filed separately). That assertion can fire even when this check is fixed and every adopted checkpoint fully respects the pinned-timestamp invariant - confirmed empirically with a repro that keeps a wide oldest lag so this panic never fires, yet the other assertion still fires reliably. The two bugs operate at different granularities: this one is a connection-wide timestamp invariant; the other is a per-cursor/per-key hazard.
- is related to
-
WT-17969 Follower layered cursor aborts in __clayered_reopen_stable when a read-timestamped reader's parked stable key vanishes from a legitimately-adopted newer checkpoint
-
- In Code Review
-