TSAN data race in __wt_delete_page_rollback writing to instantiated tombstone concurrently read by cursor

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Transactions
    • None
    • Storage Engines - Transactions
    • 233.549
    • SE Transactions - 2026-06-05
    • None

      Summary

      TSAN reports a data race between __wt_delete_page_rollback (rolling back a fast-delete transaction) and a concurrent cursor reader traversing the same instantiated tombstone WT_UPDATE structs. Two overlapping races on the same 64-byte heap block.

      Reproduction

      ./t -c ../../../test/format/CONFIG.stress -T bulk,txn,retain=50 runs.rows=100000:300000 runs.tables=1:3 runs.ops=300000
      

      Nondeterministic. Requires TSAN build of test/format.

      Relevant WT_UPDATE memory layout

      offset  0 : txnid           volatile uint64_t           <- Race 2
      offset  8 : union u (16 bytes)
                    commit.durable_ts   (upd_durable_ts)
                    commit.start_ts     (upd_start_ts)         <- Race 1 (offset 16)
                    prepare_rollback.rollback_ts (upd_rollback_ts)
                    prepare_rollback.saved_txnid (upd_saved_txnid)  <- Race 1 (same bytes)
      

      upd_start_ts and upd_saved_txnid are the same 8 bytes at offset 16 — overlapping union members (btmem.h:1494-1513).

      Race 1 — offset 16 (bt_delete.c:302 vs timestamp_inline.h:108)

      Writer T102 (__wt_delete_page_rollback, rolling back):

      // bt_delete.c:302 — unconditional, no prepare guard
      __wt_atomic_store_uint64_relaxed(&(*updp)->upd_saved_txnid, (*updp)->txnid);
      

      Reader T100 (WT_TIME_WINDOW_SET_STOP, non-prepare path):

      // timestamp_inline.h:108
      (tw)->stop_ts = (upd)->upd_start_ts;   // same 8 bytes as upd_saved_txnid
      

      T102 stores a transaction ID into the memory slot that T100 reads as a timestamp. If T102 writes first, T100 records a txnid value as a stop timestamp — semantically corrupt.

      Race 2 — offset 0 (bt_delete.c:303 vs timestamp_inline.h:99)

      Writer T102:

      // bt_delete.c:303
      __wt_atomic_store_uint64_v_relaxed(&(*updp)->txnid, WT_TXN_ABORTED);
      

      Reader T100:

      // timestamp_inline.h:99
      (tw)->stop_txn = (upd)->txnid;
      

      T100 passed the visibility check (read txnid, found it in-snapshot), then called __wt_upd_value_assign. Between that check and this second read, T102 wrote WT_TXN_ABORTED. T100 copies WT_TXN_ABORTED into stop_txn.

      Why existing locking doesn't prevent the race

      T102 holds M0 (its session spinlock); T100 holds M1 (a different session spinlock). The comment at bt_delete.c:299"The ref is locked, no need to pay attention to memory ordering here" — only prevents other threads from acquiring new hazard pointers. T100 already holds a hazard pointer acquired during _wti_delete_page_instantiate and reads the tombstone fields with no lock that synchronizes against _wt_delete_page_rollback.

      Root cause: unconditional upd_saved_txnid write

      The write at bt_delete.c:302 is unconditional — it always writes upd_saved_txnid even for non-prepared transactions, where upd_saved_txnid has no consumer:

      for (; *updp != NULL; ++updp) {
          if (F_ISSET(&txn->time_point, WT_TXN_TIME_POINT_HAS_TS_ROLLBACK))
              (*updp)->upd_rollback_ts = txn->time_point.rollback_timestamp;
          __wt_atomic_store_uint64_relaxed(&(*updp)->upd_saved_txnid, (*updp)->txnid); // always
          __wt_atomic_store_uint64_v_relaxed(&(*updp)->txnid, WT_TXN_ABORTED);
      }
      

      For a non-prepared fast-delete rollback, this silently clobbers upd_start_ts with a txnid value at exactly the moment a concurrent reader may be reading it as a timestamp.

      Fix direction

      • Race 1: Guard the upd_saved_txnid write (bt_delete.c:302) under the same WT_TXN_TIME_POINT_HAS_TS_ROLLBACK condition as upd_rollback_ts (lines 300-301). For non-prepared rollbacks, this write serves no purpose and clobbers upd_start_ts.
      • Race 2: The txnid = WT_TXN_ABORTED write is necessary. The reader needs either a consistent acquire/release pair around the txnid read + subsequent timestamp reads, or a guarantee that it will not read upd_start_ts after observing WT_TXN_ABORTED (use __wt_tsan_suppress_load_uint64_v as done at txn_inline.h:1548).

            Assignee:
            [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            Wei Hu
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: