-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Critical - P2
-
None
-
Affects Version/s: None
-
Component/s: Cursors
-
Storage Engines - Foundations
-
None
-
None
Summary
During a collection scan on a layered cursor, if the stable cursor is positioned on a key X that has not yet been returned to the caller (because the ingest cursor returned a smaller key), and the application commits the current transaction and starts a new one, the scan may subsequently return a value for X that was evaluated under the old transaction's visibility (wrong read_timestamp or stale snapshot).
Background
The layered cursor merges two constituent cursors: an ingest cursor (local follower writes) and a stable cursor (checkpointed data). During next(), both cursors are advanced and __clayered_get_current() picks the one with the smaller key. The non-current cursor is left positioned at a key it has not yet returned.
Within a checkpoint, the btree cursor still performs per-key visibility checks using the session's current transaction context (read_timestamp / snapshot). The value stored in the cursor's buffer at the time next() was called reflects the visibility rules of that transaction.
Bug
Scenario:
- Layered cursor scan is in progress. Stable cursor is positioned at key X (not yet returned); ingest cursor returned key Y where Y < X. WT_CLAYERED_ITERATE_NEXT is set.
- Application commits transaction T1 and starts transaction T2 (different read_timestamp or new snapshot).
- Application calls cursor.next().
- In __clayered_should_advance_stable() ([cur_layered.c|src/cursor/cur_layered.c]):
- The stable cursor is not the current_cursor, so the guard at line 368–370 does not block.
- Non-timestamp path (line 383): if (iteration) return (false) — the stable cursor is not reopened. X's value in the cursor buffer was evaluated under T1's snapshot and is returned as-is under T2.
- The ingest cursor advances. Eventually X becomes the smaller key and __clayered_get_current() selects the stable cursor. The value returned for X was evaluated under T1's visibility, which may be incorrect under T2.
Affected cases
Non-timestamp case (snapshot isolation)
__clayered_should_advance_stable returns false mid-iteration (line 383) regardless of snapshot change. The stable cursor's buffered value at X reflects T1's snapshot. If T2's snapshot changes what version of X is visible (e.g. because the checkpoint contains multiple versions), the wrong value is returned.
Timestamp case — current behavior vs. future optimization
Currently, when a read_timestamp is set, _clayered_should_advance_stable returns true unconditionally (line 379), causing the stable cursor to be reopened on the latest checkpoint on *every* next() call. clayered_advance_stable then duplicates the old cursor's position to the new cursor via _wt_cursor_dup_position, which triggers a fresh read under the new checkpoint and new read_timestamp. This incidentally fixes the visibility issue.
However, if this is optimized to only reopen when a new checkpoint is available (which is the natural performance improvement to pursue), the timestamp case would have the same bug: the stable cursor would remain positioned at X with its value evaluated under T1's read_timestamp. If T2 has a different read_timestamp, the value returned for X may correspond to a different historical version than what T2 should see.
Root cause
__clayered_iterate_constituents does not re-evaluate the visibility of a key that the stable cursor is already positioned on but has not yet returned. The value is computed once during next() on the stable constituent and cached; it is not recomputed when the surrounding transaction changes.
- is related to
-
WT-17030 Avoid reopening the stable table for each operation on follower
-
- Open
-