Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Works as Designed
Priority: Critical - P2
Fix Version/s: None
Affects Version/s: None
Component/s: Checkpoints
Labels:

Assigned Teams:

Storage Engines
Total Hours with Assigned Team:
1,520.169
Sprint:
SE Persistence backlog
Story Points:
None

Summary

When a stable btree has in-memory INPROGRESS prepared updates whose prepare_ts > stable_ts, the precise checkpoint skip logic in rec_visibility.c omits those updates from the page's time aggregate, producing ta.prepare=0 (and all-zero timestamps). This zeroed time aggregate is then written to the local metadata via _wt_meta_ckptlist_set, enqueued to the shared metadata table by block_disagg_checkpoint_resolve, and persisted to shared storage. When a follower subsequently picks up that checkpoint, its local metadata for the stable btree has prepare=0, causing prepared_discover_btree_has_prepare to return false and prepare_discover to skip the stable btree entirely — leaving prepared transactions unresolved and eventually triggering the assertion in _layered_assert_stable_btree_state on the next step-up drain.

Steps to Reproduce

Run the CONFIG posted in comment in evergreen.

Evidence

Trace output from [SHARED_META_WRITE] shows the wrong value being written to the shared metadata table:

[SHARED_META_WRITE] op=UPDATE key=file:T00003.wt_stable value=...checkpoint=(WiredTigerCheckpoint.4=(addr="...",order=4,...,newest_start_durable_ts=0,oldest_start_ts=0,newest_txn=0,...,prepare=0,...))

All timestamp fields are zero even though the stable btree holds in-memory INPROGRESS prepared updates that need to be discovered by the follower.

Root Cause

In src/reconcile/rec_visibility.c, the precise checkpoint path skips INPROGRESS updates when prepare_ts > rec_start_pinned_stable_ts:

if (upd->prepare_ts > r->rec_start_pinned_stable_ts) {
    *has_newer_updatesp = true;
    if (upd->txnid == WT_TXN_ABORTED && upd->type != WT_UPDATE_TOMBSTONE)
        upd_select->skip_aborted_prepared_value = true;
    continue;  // ta.prepare stays 0
}

Skipping the update prevents the time window from being aggregated, so ta.prepare remains 0. The downstream chain then propagates this incorrect value all the way to the shared metadata table and eventually to the follower's local metadata.

Impact

prepare_discover silently skips stable btrees that contain prepared transactions.
Prepared transactions are left unresolved across role switches.
On the next step-up, the drain asserts in __layered_assert_stable_btree_state when it finds an unresolved INPROGRESS update on the stable btree.

Affected Components

src/reconcile/rec_visibility.c — precise checkpoint skip logic
src/block_disagg/block_disagg_ckpt.c — __block_disagg_checkpoint_resolve
src/conn/conn_layered.c — __disagg_apply_checkpoint_meta
src/prepared_discover/prepared_discover_walk.c — __prepared_discover_btree_has_prepare

is related to

WT-17459 test/format (disagg.mode=switch) WiredTiger assertion failed: 'upd->prepare_state != (uint8_t)1 && upd->prepare_state != (uint8_t)2'

Closed

related to

WT-17586 Drain leaks aborted prepared marker without a globally visible fallback

Closed

Assignee:: [DO NOT USE] Backlog - Storage Engines Team
Reporter:: Chenhao Qu
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: May 19 2026 11:59:56 PM UTC
Updated:: Jul 09 2026 08:28:45 AM UTC
Resolved:: May 20 2026 01:56:51 AM UTC

Details

Description

Summary

Steps to Reproduce

Evidence

Root Cause

Impact

Affected Components

Attachments

Issue Links

Activity

People

Dates