-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Transactions
-
None
-
Storage Engines - Transactions
-
233.549
-
SE Transactions - 2026-06-05
-
None
Summary
TSAN reports a data race between __wt_delete_page_rollback (rolling back a fast-delete transaction) and a concurrent cursor reader traversing the same instantiated tombstone WT_UPDATE structs. Two overlapping races on the same 64-byte heap block.
Reproduction
./t -c ../../../test/format/CONFIG.stress -T bulk,txn,retain=50 runs.rows=100000:300000 runs.tables=1:3 runs.ops=300000
Nondeterministic. Requires TSAN build of test/format.
Relevant WT_UPDATE memory layout
offset 0 : txnid volatile uint64_t <- Race 2
offset 8 : union u (16 bytes)
commit.durable_ts (upd_durable_ts)
commit.start_ts (upd_start_ts) <- Race 1 (offset 16)
prepare_rollback.rollback_ts (upd_rollback_ts)
prepare_rollback.saved_txnid (upd_saved_txnid) <- Race 1 (same bytes)
upd_start_ts and upd_saved_txnid are the same 8 bytes at offset 16 — overlapping union members (btmem.h:1494-1513).
Race 1 — offset 16 (bt_delete.c:302 vs timestamp_inline.h:108)
Writer T102 (__wt_delete_page_rollback, rolling back):
// bt_delete.c:302 — unconditional, no prepare guard
__wt_atomic_store_uint64_relaxed(&(*updp)->upd_saved_txnid, (*updp)->txnid);
Reader T100 (WT_TIME_WINDOW_SET_STOP, non-prepare path):
// timestamp_inline.h:108 (tw)->stop_ts = (upd)->upd_start_ts; // same 8 bytes as upd_saved_txnid
T102 stores a transaction ID into the memory slot that T100 reads as a timestamp. If T102 writes first, T100 records a txnid value as a stop timestamp — semantically corrupt.
Race 2 — offset 0 (bt_delete.c:303 vs timestamp_inline.h:99)
Writer T102:
// bt_delete.c:303
__wt_atomic_store_uint64_v_relaxed(&(*updp)->txnid, WT_TXN_ABORTED);
Reader T100:
// timestamp_inline.h:99
(tw)->stop_txn = (upd)->txnid;
T100 passed the visibility check (read txnid, found it in-snapshot), then called __wt_upd_value_assign. Between that check and this second read, T102 wrote WT_TXN_ABORTED. T100 copies WT_TXN_ABORTED into stop_txn.
Why existing locking doesn't prevent the race
T102 holds M0 (its session spinlock); T100 holds M1 (a different session spinlock). The comment at bt_delete.c:299 — "The ref is locked, no need to pay attention to memory ordering here" — only prevents other threads from acquiring new hazard pointers. T100 already holds a hazard pointer acquired during _wti_delete_page_instantiate and reads the tombstone fields with no lock that synchronizes against _wt_delete_page_rollback.
Root cause: unconditional upd_saved_txnid write
The write at bt_delete.c:302 is unconditional — it always writes upd_saved_txnid even for non-prepared transactions, where upd_saved_txnid has no consumer:
for (; *updp != NULL; ++updp) { if (F_ISSET(&txn->time_point, WT_TXN_TIME_POINT_HAS_TS_ROLLBACK)) (*updp)->upd_rollback_ts = txn->time_point.rollback_timestamp; __wt_atomic_store_uint64_relaxed(&(*updp)->upd_saved_txnid, (*updp)->txnid); // always __wt_atomic_store_uint64_v_relaxed(&(*updp)->txnid, WT_TXN_ABORTED); }
For a non-prepared fast-delete rollback, this silently clobbers upd_start_ts with a txnid value at exactly the moment a concurrent reader may be reading it as a timestamp.
Fix direction
- Race 1: Guard the upd_saved_txnid write (bt_delete.c:302) under the same WT_TXN_TIME_POINT_HAS_TS_ROLLBACK condition as upd_rollback_ts (lines 300-301). For non-prepared rollbacks, this write serves no purpose and clobbers upd_start_ts.
- Race 2: The txnid = WT_TXN_ABORTED write is necessary. The reader needs either a consistent acquire/release pair around the txnid read + subsequent timestamp reads, or a guarantee that it will not read upd_start_ts after observing WT_TXN_ABORTED (use __wt_tsan_suppress_load_uint64_v as done at txn_inline.h:1548).
- is related to
-
WT-16373 failed: format-stress-test-tsan on ubuntu2004-tsan [wiredtiger @ 5858bb39]
-
- Closed
-