Disaggregated: precise checkpoint writes wrong time aggregate (prepare=0) to shared metadata for stable btrees with skipped INPROGRESS prepared updates

XMLWordPrintableJSON

    • Storage Engines
    • 432.653
    • SE Persistence backlog
    • None

      Summary

      When a stable btree has in-memory INPROGRESS prepared updates whose prepare_ts > stable_ts, the precise checkpoint skip logic in rec_visibility.c omits those updates from the page's time aggregate, producing ta.prepare=0 (and all-zero timestamps). This zeroed time aggregate is then written to the local metadata via _wt_meta_ckptlist_set, enqueued to the shared metadata table by block_disagg_checkpoint_resolve, and persisted to shared storage. When a follower subsequently picks up that checkpoint, its local metadata for the stable btree has prepare=0, causing prepared_discover_btree_has_prepare to return false and prepare_discover to skip the stable btree entirely — leaving prepared transactions unresolved and eventually triggering the assertion in _layered_assert_stable_btree_state on the next step-up drain.

      Steps to Reproduce

      Run the CONFIG posted in comment in evergreen.

      Evidence

      Trace output from [SHARED_META_WRITE] shows the wrong value being written to the shared metadata table:

      [SHARED_META_WRITE] op=UPDATE key=file:T00003.wt_stable value=...checkpoint=(WiredTigerCheckpoint.4=(addr="...",order=4,...,newest_start_durable_ts=0,oldest_start_ts=0,newest_txn=0,...,prepare=0,...))
      

      All timestamp fields are zero even though the stable btree holds in-memory INPROGRESS prepared updates that need to be discovered by the follower.

      Root Cause

      In src/reconcile/rec_visibility.c, the precise checkpoint path skips INPROGRESS updates when prepare_ts > rec_start_pinned_stable_ts:

      if (upd->prepare_ts > r->rec_start_pinned_stable_ts) {
          *has_newer_updatesp = true;
          if (upd->txnid == WT_TXN_ABORTED && upd->type != WT_UPDATE_TOMBSTONE)
              upd_select->skip_aborted_prepared_value = true;
          continue;  // ta.prepare stays 0
      }
      

      Skipping the update prevents the time window from being aggregated, so ta.prepare remains 0. The downstream chain then propagates this incorrect value all the way to the shared metadata table and eventually to the follower's local metadata.

      Impact

      • prepare_discover silently skips stable btrees that contain prepared transactions.
      • Prepared transactions are left unresolved across role switches.
      • On the next step-up, the drain asserts in __layered_assert_stable_btree_state when it finds an unresolved INPROGRESS update on the stable btree.

      Affected Components

      • src/reconcile/rec_visibility.c — precise checkpoint skip logic
      • src/block_disagg/block_disagg_ckpt.c__block_disagg_checkpoint_resolve
      • src/conn/conn_layered.c__disagg_apply_checkpoint_meta
      • src/prepared_discover/prepared_discover_walk.c__prepared_discover_btree_has_prepare

            Assignee:
            [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            Chenhao Qu
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: