Disaggregated storage checkpoint pick-up pinned-timestamp panic check reads unpopulated metadata (dead code)

XMLWordPrintableJSON

      Problem

      __disagg_pick_up_checkpoint is supposed to refuse (panic) adopting a checkpoint whose oldest_timestamp exceeds the connection's current pinned timestamp - protecting any actively-pinned reader from having its data pruned out from under it. The check, as written, can never fire.

      Root cause

      In src/conn/conn_layered_checkpoint_pick_up.c, __disagg_pick_up_checkpoint:

      1. WT_DISAGG_METADATA metadata is declared and WT_CLEAR(metadata)'d near the top of the function.
      2. The pinned-timestamp panic check runs immediately after, reading metadata.oldest_timestamp - which is still zero from the clear.
      3. Only afterward does _wti_disagg_fetch_shared_meta + _wt_disagg_parse_meta actually populate metadata with the real checkpoint fields, including the real oldest_timestamp.

      So the check is effectively 0 > pinned_timestamp, which is essentially always false (pinned_timestamp, when set, is >= 1; when unset the != WT_TS_NONE guard already skips the check). The safety net is dead code - a checkpoint whose real oldest_timestamp exceeds an actively-pinned reader's timestamp can be silently adopted.

      Fix

      Move the pinned-timestamp check (the _wt_txn_update_pinned_timestamp refresh + the comparison + panic) to after the _wt_disagg_parse_meta call, so it reads the real, parsed metadata.oldest_timestamp instead of the pre-parse zeroed placeholder. No other logic changes.

      Verification

      Confirmed with a scripted single-scenario repro that forces a checkpoint whose real oldest_timestamp exceeds an active reader's pinned read timestamp:

      • Before the fix: adoption silently succeeds; the stale pin later surfaces as an unrelated, confusing crash in __clayered_reopen_stable (see the companion ticket for that assertion) when the reader's cursor tries to advance.
      • After the fix: __disagg_pick_up_checkpoint correctly panics at adoption time with "Disaggregated storage checkpoint oldest_timestamp ... is greater than the current pinned timestamp ...".

      Open question for the fix's severity/design

      Once this check correctly fires, it does so via a hard __wt_panic (full process abort + restart), not a graceful refusal or a per-reader failure. A reader that simply falls behind the checkpoint cadence (e.g. a slow scan under sustained write load) will now crash the whole node rather than just failing or blocking that one reader. Worth a design discussion on whether panic is the intended response here, or whether checkpoint adoption should instead be deferred/refused non-fatally when this condition is detected.

      Relationship to the __clayered_reopen_stable assertion

      This is a separate, non-overlapping bug from the "upgrading a positioned stable cursor" assertion in __clayered_reopen_stable (filed separately). That assertion can fire even when this check is fixed and every adopted checkpoint fully respects the pinned-timestamp invariant - confirmed empirically with a repro that keeps a wide oldest lag so this panic never fires, yet the other assertion still fires reliably. The two bugs operate at different granularities: this one is a connection-wide timestamp invariant; the other is a per-cursor/per-key hazard.

            Assignee:
            Haribabu Kommi
            Reporter:
            Haribabu Kommi
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: