-
Type:
Bug
-
Resolution: Duplicate
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Not Applicable
-
Storage Engines - Foundations
-
118.965
-
SE Persistence backlog
-
None
Symptom
While trying to reproduce WT-17304 with its config (format-stress-test-disagg-switch — disagg.mode=switch, ops.prepare=1, precise_checkpoint=1, preserve_prepared=1, runs.source=layered, disagg.multi=0), an eviction worker thread aborts on the following assertion:
src/reconcile/rec_visibility.c:497
WT_ASSERT(session, select_tw->stop_txn >= select_tw->start_txn);
Stack:
#3 __timestamp_no_ts_fix rec_visibility.c:497 #4 __wti_rec_upd_select rec_visibility.c:1657 #5 __wti_rec_row_leaf rec_row.c:1138 #6 __reconcile rec_write.c:316 #7 __wt_reconcile rec_write.c:127 #8 __evict_reconcile evict_page.c:1277 #9 __wt_evict evict_page.c:443 #10 __wti_evict_page evict_dispatch.c:254 #11 __wti_evict_lru_pages evict_queue.c:140 #12 __evict_thread_run evict_thread.c:117 #13 __thread_run thread_group.c:32
The faulting page is on a layered table's stable constituent: session->dhandle->name = "file:T00001.wt_stable".
State at the assertion
The selected time window for the row about to be reconciled has its txn ids inverted while its timestamps are correctly ordered:
(gdb) p *select_tw
$3 = {
durable_start_ts = 3187849,
start_ts = 3187849,
start_prepare_ts = 0,
start_txn = 1599698,
start_prepared_id = 0,
durable_stop_ts = 3188959,
stop_ts = 3188959,
stop_prepare_ts = 0,
stop_txn = 1599460,
stop_prepared_id = 0
}
- stop_ts (3188959) > start_ts (3187849) → timestamps consistent.
- stop_txn (1599460) < start_txn (1599698) → assertion fails.
Update chain that produced it
For the same key, the in-memory update chain visited by __wti_rec_upd_select is:
WT_ROW_UPDATE(page, rip) = 0x138e1a85ddc0 /* head: tombstone */ txnid = 1599460 durable_ts = 3188959, start_ts = 3188959 prepared_id = 0, prepare_ts = 0 prepare_state = 0 type = 4 /* WT_UPDATE_TOMBSTONE */ flags = 0x800 /* WT_UPDATE_RESTORED_FROM_INGEST */ next -> 0x138e08aac7d0 next: txnid = 1599698 durable_ts = 3187849, start_ts = 3187849 prepared_id = 0, prepare_ts = 0 prepare_state = 0 type = 3 /* WT_UPDATE_STANDARD */ flags = 0x204 /* WT_UPDATE_DS | WT_UPDATE_RESTORED_FROM_DS */
Notes:
- The head update is a tombstone flagged WT_UPDATE_RESTORED_FROM_INGEST — it was created during the ingest → stable drain on step-up, carrying the ingest btree's transaction id verbatim.
- The underlying value is flagged WT_UPDATE_DS | WT_UPDATE_RESTORED_FROM_DS — it was instantiated from the stable's on-disk cell when the page was read in.
- Neither update is prepared; both have prepare_state=0, prepared_id=0, prepare_ts=0.
Additional observations
- S2BT(session)
>base_write_gen = 19for this btree, whereas the page header reports a much larger write_gen: ((WT_PAGE_HEADER *)page>dsk)->write_gen = 667. - Because dsk_write_gen > base_write_gen, the existing on-disk-cell txn-id cleanup (_cell_unpack_window_need_cleanup → _cell_kv_window_cleanup) is skipped for this page; the RESTORED_FROM_DS update therefore retains whatever start_txn was stored on disk.
- The tombstone's txnid=1599460 appears to be a real current-run ingest txn id, not a leaked one — i.e. the moved-from-ingest update legitimately preserves its txn id.
So the two halves of the chain were produced by different code paths that did not coordinate on the txn-id namespace, leading to the stop_txn < start_txn state that the assertion forbids.
Reproduction
- Reproduces against the
WT-17304task configuration on aarch64. - Reproducer: format-stress-test-disagg-switch with disagg.mode=switch, ops.prepare=1, precise_checkpoint=1, preserve_prepared=1, runs.source=layered, disagg.multi=0.
Open questions / what's not yet established
- Whether the root cause is:
- (a) the ingest-drain code path putting a tombstone with a non-comparable txn id onto a stable chain;
- (b) the RESTORED_FROM_DS value retaining a stale on-disk start_txn that was not cleared because the page header reports a current-run write_gen while its base image is from an earlier epoch;
- (c) the assertion itself being too strict for the layered-table case; or
- some combination.
- Whether this is the same root cause as
WT-17304(the discover-walk-skip / lost prepare-bit story) or a sibling bug.
The base_write_gen=19 vs dsk_write_gen=667 mismatch is suggestive but has not been proven to be the cause; it's listed in "Observations" not "Root cause".
Related
WT-17304— format-stress-test-disagg-switch timed out with prepare-conflict. Surfaced this while reproducing.
- duplicates
-
WT-17603 [Disagg] format-stress switch+fast-truncate: assertion 'stop_txn >= start_txn' in __timestamp_no_ts_fix (eviction)
-
- Open
-
- is related to
-
WT-17304 format-stress-test-disagg-switch timed out with prepare-conflict
-
- Closed
-
- related to
-
WT-17304 format-stress-test-disagg-switch timed out with prepare-conflict
-
- Closed
-