Layered cursor may return wrong value when transaction changes mid-scan

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Critical - P2
    • None
    • Affects Version/s: None
    • Component/s: Cursors
    • Storage Engines - Foundations
    • None
    • None

      Summary

      During a collection scan on a layered cursor, if the stable cursor is positioned on a key X that has not yet been returned to the caller (because the ingest cursor returned a smaller key), and the application commits the current transaction and starts a new one, the scan may subsequently return a value for X that was evaluated under the old transaction's visibility (wrong read_timestamp or stale snapshot).

      Background

      The layered cursor merges two constituent cursors: an ingest cursor (local follower writes) and a stable cursor (checkpointed data). During next(), both cursors are advanced and __clayered_get_current() picks the one with the smaller key. The non-current cursor is left positioned at a key it has not yet returned.

      Within a checkpoint, the btree cursor still performs per-key visibility checks using the session's current transaction context (read_timestamp / snapshot). The value stored in the cursor's buffer at the time next() was called reflects the visibility rules of that transaction.

      Bug

      Scenario:

      1. Layered cursor scan is in progress. Stable cursor is positioned at key X (not yet returned); ingest cursor returned key Y where Y < X. WT_CLAYERED_ITERATE_NEXT is set.
      2. Application commits transaction T1 and starts transaction T2 (different read_timestamp or new snapshot).
      3. Application calls cursor.next().
      4. In __clayered_should_advance_stable() ([cur_layered.c|src/cursor/cur_layered.c]):
        • The stable cursor is not the current_cursor, so the guard at line 368–370 does not block.
        • Non-timestamp path (line 383): if (iteration) return (false) — the stable cursor is not reopened. X's value in the cursor buffer was evaluated under T1's snapshot and is returned as-is under T2.
      5. The ingest cursor advances. Eventually X becomes the smaller key and __clayered_get_current() selects the stable cursor. The value returned for X was evaluated under T1's visibility, which may be incorrect under T2.

      Affected cases

      Non-timestamp case (snapshot isolation)

      __clayered_should_advance_stable returns false mid-iteration (line 383) regardless of snapshot change. The stable cursor's buffered value at X reflects T1's snapshot. If T2's snapshot changes what version of X is visible (e.g. because the checkpoint contains multiple versions), the wrong value is returned.

      Timestamp case — current behavior vs. future optimization

      Currently, when a read_timestamp is set, _clayered_should_advance_stable returns true unconditionally (line 379), causing the stable cursor to be reopened on the latest checkpoint on *every* next() call. clayered_advance_stable then duplicates the old cursor's position to the new cursor via _wt_cursor_dup_position, which triggers a fresh read under the new checkpoint and new read_timestamp. This incidentally fixes the visibility issue.

      However, if this is optimized to only reopen when a new checkpoint is available (which is the natural performance improvement to pursue), the timestamp case would have the same bug: the stable cursor would remain positioned at X with its value evaluated under T1's read_timestamp. If T2 has a different read_timestamp, the value returned for X may correspond to a different historical version than what T2 should see.

      Root cause

      __clayered_iterate_constituents does not re-evaluate the visibility of a key that the stable cursor is already positioned on but has not yet returned. The value is computed once during next() on the stable constituent and cached; it is not recomputed when the surrounding transaction changes.

            Assignee:
            [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            Chenhao Qu
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: