Layered cursor writes on follower do not check stable cell's full time window

XMLWordPrintableJSON

    • Storage Engines - Foundations
    • None
    • 5

      Summary

      When a layered cursor on the follower performs a write that depends on a pre-existing value (remove / update / insert-existence check), it only consults the session-visible state of the stable constituent. A committed stop_ts on the stable cell can be invisible at the session's read_ts but visible to the drain. This lets the session issue a write on a key that, from the drain's timestamp-independent view, has nothing to operate on — producing unresolvable state at drain time (e.g. the __layered_assert_tombstone_has_value_on_stable_btree assert seen in WT-17240).

      Root cause

      Layered cursor writes go to the ingest btree, but the "does this key have a live value?" decision is based on what the session can read at its read_ts from the stable btree. MVCC's normal write-write conflict detection is per-btree, so a newer committed stop on stable is not surfaced to an ingest write path.

      Concretely, in cur_layered.c __clayered_remove_follower:

      • When positioned=true and current_cursor == stable_cursor: no check — the tombstone is written unconditionally.
      • When positioned=false: _clayered_lookup_clayered_lookup_constituentstable_cursor->search() returns V because read_ts [ stop_ts, even though the stable cell carries a committed stop.

      In both cases, the session writes a tombstone to ingest for a key whose stable cell already has a stop. On the next drain, __layered_assert_tombstone_has_value_on_stable_btree fires with has_value=false (stable cell has HAS_STOP=true) and the ingest tombstone is not globally visible.

      Reproducer

      test/format in disagg.mode=switch with ops.prepare=1 and preserve_prepared=1, enabled by WT-15795. See WT-17240 for a concrete stack trace and the aborted-prepared tombstone observed at conn_layered_ingest.c:309:

      • Leader period: K=V with committed stop_ts=S on stable.
      • Stepdown to follower.
      • Follower prepared txn reads K at read_ts=R < S → stable cell's stop is invisible at R → sees V → DELETE K → tombstone in ingest.
      • Rollback turns it into an aborted-prepared rollback marker.
      • First stepup + drain → assert fires.

      Insert and update have the same problem

      The issue is not specific to remove. Any layered-cursor write that consults "does K exist live on stable?" to decide what to write to ingest is subject to the same staleness:

      • __clayered_insert / insert-existence check (e.g. no-overwrite mode): session reads at R, sees V on stable (because stable's stop is invisible at R), concludes "key exists, insert would be a duplicate" — or conversely treats a key as absent when the stable cell carries an invisible-to-R stop that the drain will honor. Either decision diverges from the drain's view.
      • _clayered_update / modify-follower path (see _clayered_modify_follower at cur_layered.c:2572): reads the visible-at-R value from stable as the base for the modify / update, then writes the result to ingest. Same timestamp-skew problem — the stable base value may already have a committed stop that the drain respects.

      All three operations need the same guard: before writing the ingest update, verify the stable cell's full time window (WT_TIME_WINDOW_HAS_STOP) — the drain's view — not just the session-visible value.

      Fix

      For the remove path, the minimal guard is in *clayered_remove_follower: when the layered cursor is positioned (or lookup lands) on the stable constituent, check ((WT_CURSOR_BTREE *)clayered-]stable_cursor)>upd_value>tw via WT_TIME_WINDOW_HAS_STOP and return WT_NOTFOUND if set. This mirrors the predicate used in *layered_assert_tombstone_has_value_on_stable_btree so the write path and drain assert agree on "nothing to delete."

      The same guard shape should be applied to the insert existence check and the update/modify-follower paths to prevent analogous drain-time inconsistencies.

      Related

      • WT-17240 — the drain assert that surfaces this bug for the remove path.
      • WT-15795 — enabled prepare ops for disagg test/format, which made the scenario reachable in CI.

            Assignee:
            [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            Chenhao Qu
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: