Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Duplicate
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Not Applicable
Labels:
- dc

Assigned Teams:

Storage Engines - Foundations
Total Hours with Assigned Team:
1,630.973
Sprint:
SE Persistence backlog
Story Points:
None

Symptom

While trying to reproduce ~~WT-17304~~ with its config (format-stress-test-disagg-switch — disagg.mode=switch, ops.prepare=1, precise_checkpoint=1, preserve_prepared=1, runs.source=layered, disagg.multi=0), an eviction worker thread aborts on the following assertion:

src/reconcile/rec_visibility.c:497
    WT_ASSERT(session, select_tw->stop_txn >= select_tw->start_txn);

Stack:

#3 __timestamp_no_ts_fix         rec_visibility.c:497
#4 __wti_rec_upd_select          rec_visibility.c:1657
#5 __wti_rec_row_leaf            rec_row.c:1138
#6 __reconcile                   rec_write.c:316
#7 __wt_reconcile                rec_write.c:127
#8 __evict_reconcile             evict_page.c:1277
#9 __wt_evict                    evict_page.c:443
#10 __wti_evict_page             evict_dispatch.c:254
#11 __wti_evict_lru_pages        evict_queue.c:140
#12 __evict_thread_run           evict_thread.c:117
#13 __thread_run                 thread_group.c:32

The faulting page is on a layered table's stable constituent: session->dhandle->name = "file:T00001.wt_stable".

State at the assertion

The selected time window for the row about to be reconciled has its txn ids inverted while its timestamps are correctly ordered:

(gdb) p *select_tw
$3 = {
  durable_start_ts = 3187849,
  start_ts        = 3187849,
  start_prepare_ts = 0,
  start_txn       = 1599698,
  start_prepared_id = 0,
  durable_stop_ts = 3188959,
  stop_ts         = 3188959,
  stop_prepare_ts = 0,
  stop_txn        = 1599460,
  stop_prepared_id = 0
}

stop_ts (3188959) > start_ts (3187849) → timestamps consistent.
stop_txn (1599460) < start_txn (1599698) → assertion fails.

Update chain that produced it

For the same key, the in-memory update chain visited by __wti_rec_upd_select is:

WT_ROW_UPDATE(page, rip) = 0x138e1a85ddc0   /* head: tombstone */
  txnid        = 1599460
  durable_ts   = 3188959, start_ts = 3188959
  prepared_id  = 0, prepare_ts = 0
  prepare_state = 0
  type         = 4   /* WT_UPDATE_TOMBSTONE */
  flags        = 0x800   /* WT_UPDATE_RESTORED_FROM_INGEST */
  next         -> 0x138e08aac7d0

next:
  txnid        = 1599698
  durable_ts   = 3187849, start_ts = 3187849
  prepared_id  = 0, prepare_ts = 0
  prepare_state = 0
  type         = 3   /* WT_UPDATE_STANDARD */
  flags        = 0x204   /* WT_UPDATE_DS | WT_UPDATE_RESTORED_FROM_DS */

Notes:

The head update is a tombstone flagged WT_UPDATE_RESTORED_FROM_INGEST — it was created during the ingest → stable drain on step-up, carrying the ingest btree's transaction id verbatim.
The underlying value is flagged WT_UPDATE_DS | WT_UPDATE_RESTORED_FROM_DS — it was instantiated from the stable's on-disk cell when the page was read in.
Neither update is prepared; both have prepare_state=0, prepared_id=0, prepare_ts=0.

Additional observations

S2BT(session)>base_write_gen = 19 for this btree, whereas the page header reports a much larger write_gen: ((WT_PAGE_HEADER *)page>dsk)->write_gen = 667.
Because dsk_write_gen > base_write_gen, the existing on-disk-cell txn-id cleanup (_cell_unpack_window_need_cleanup → _cell_kv_window_cleanup) is skipped for this page; the RESTORED_FROM_DS update therefore retains whatever start_txn was stored on disk.
The tombstone's txnid=1599460 appears to be a real current-run ingest txn id, not a leaked one — i.e. the moved-from-ingest update legitimately preserves its txn id.

So the two halves of the chain were produced by different code paths that did not coordinate on the txn-id namespace, leading to the stop_txn < start_txn state that the assertion forbids.

Reproduction

Reproduces against the ~~WT-17304~~ task configuration on aarch64.
Reproducer: format-stress-test-disagg-switch with disagg.mode=switch, ops.prepare=1, precise_checkpoint=1, preserve_prepared=1, runs.source=layered, disagg.multi=0.

Open questions / what's not yet established

Whether the root cause is:
- (a) the ingest-drain code path putting a tombstone with a non-comparable txn id onto a stable chain;
- (b) the RESTORED_FROM_DS value retaining a stale on-disk start_txn that was not cleared because the page header reports a current-run write_gen while its base image is from an earlier epoch;
- (c) the assertion itself being too strict for the layered-table case; or
- some combination.
Whether this is the same root cause as ~~WT-17304~~ (the discover-walk-skip / lost prepare-bit story) or a sibling bug.

The base_write_gen=19 vs dsk_write_gen=667 mismatch is suggestive but has not been proven to be the cause; it's listed in "Observations" not "Root cause".

~~WT-17304~~ — format-stress-test-disagg-switch timed out with prepare-conflict. Surfaced this while reproducing.

duplicates

WT-17603 [Disagg] format-stress switch+fast-truncate: assertion 'stop_txn >= start_txn' in __timestamp_no_ts_fix (eviction)

Open

is related to

WT-17304 format-stress-test-disagg-switch timed out with prepare-conflict

Closed

related to

WT-17304 format-stress-test-disagg-switch timed out with prepare-conflict

Closed

Assignee:: [DO NOT USE] Backlog - Storage Engines Team
Reporter:: Chenhao Qu
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: May 25 2026 04:19:24 AM UTC
Updated:: May 26 2026 12:38:20 AM UTC
Resolved:: May 26 2026 12:38:20 AM UTC

Eviction reconciliation asserts stop_txn >= start_txn on disagg stable btree during format-stress-test-disagg-switch

Symptom

State at the assertion

Update chain that produced it

Additional observations

Reproduction

Open questions / what's not yet established

Related

Details

Description

Symptom

State at the assertion

Update chain that produced it

Additional observations

Reproduction

Open questions / what's not yet established

Related

Attachments

Issue Links

Activity

People

Dates