Summary
When a layered cursor on the follower performs a write that depends on a pre-existing value (remove / update / insert-existence check), it only consults the session-visible state of the stable constituent. A committed stop_ts on the stable cell can be invisible at the session's read_ts but visible to the drain. This lets the session issue a write on a key that, from the drain's timestamp-independent view, has nothing to operate on — producing unresolvable state at drain time (e.g. the __layered_assert_tombstone_has_value_on_stable_btree assert seen in WT-17240).
Root cause
Layered cursor writes go to the ingest btree, but the "does this key have a live value?" decision is based on what the session can read at its read_ts from the stable btree. MVCC's normal write-write conflict detection is per-btree, so a newer committed stop on stable is not surfaced to an ingest write path.
Concretely, in cur_layered.c __clayered_remove_follower:
- When positioned=true and current_cursor == stable_cursor: no check — the tombstone is written unconditionally.
- When positioned=false: _clayered_lookup → _clayered_lookup_constituent → stable_cursor->search() returns V because read_ts [ stop_ts, even though the stable cell carries a committed stop.
In both cases, the session writes a tombstone to ingest for a key whose stable cell already has a stop. On the next drain, __layered_assert_tombstone_has_value_on_stable_btree fires with has_value=false (stable cell has HAS_STOP=true) and the ingest tombstone is not globally visible.
Reproducer
test/format in disagg.mode=switch with ops.prepare=1 and preserve_prepared=1, enabled by WT-15795. See WT-17240 for a concrete stack trace and the aborted-prepared tombstone observed at conn_layered_ingest.c:309:
- Leader period: K=V with committed stop_ts=S on stable.
- Stepdown to follower.
- Follower prepared txn reads K at read_ts=R < S → stable cell's stop is invisible at R → sees V → DELETE K → tombstone in ingest.
- Rollback turns it into an aborted-prepared rollback marker.
- First stepup + drain → assert fires.
Insert and update have the same problem
The issue is not specific to remove. Any layered-cursor write that consults "does K exist live on stable?" to decide what to write to ingest is subject to the same staleness:
- __clayered_insert / insert-existence check (e.g. no-overwrite mode): session reads at R, sees V on stable (because stable's stop is invisible at R), concludes "key exists, insert would be a duplicate" — or conversely treats a key as absent when the stable cell carries an invisible-to-R stop that the drain will honor. Either decision diverges from the drain's view.
- _clayered_update / modify-follower path (see _clayered_modify_follower at cur_layered.c:2572): reads the visible-at-R value from stable as the base for the modify / update, then writes the result to ingest. Same timestamp-skew problem — the stable base value may already have a committed stop that the drain respects.
All three operations need the same guard: before writing the ingest update, verify the stable cell's full time window (WT_TIME_WINDOW_HAS_STOP) — the drain's view — not just the session-visible value.
Fix
For the remove path, the minimal guard is in *clayered_remove_follower: when the layered cursor is positioned (or lookup lands) on the stable constituent, check ((WT_CURSOR_BTREE *)clayered-]stable_cursor)>upd_value>tw via WT_TIME_WINDOW_HAS_STOP and return WT_NOTFOUND if set. This mirrors the predicate used in *layered_assert_tombstone_has_value_on_stable_btree so the write path and drain assert agree on "nothing to delete."
The same guard shape should be applied to the insert existence check and the update/modify-follower paths to prevent analogous drain-time inconsistencies.
Related
- is related to
-
WT-15795 Enable the preserve prepared tests in test format for disagg runs
-
- Closed
-