Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: WT12.0.0, 9.0.0-rc0
Affects Version/s: None
Component/s: Cursors
Labels:
- Disag_Storage
- dc
- expedite
- na-mdb

Assigned Teams:

Storage Engines - Foundations
Total Hours with Assigned Team:
1,724.145
Sprint:
None
Story Points:
None

Summary

During a collection scan on a layered cursor, if the stable cursor is positioned on a key X that has not yet been returned to the caller (because the ingest cursor returned a smaller key), and the application commits the current transaction and starts a new one, the scan may subsequently return a value for X that was evaluated under the old transaction's visibility (wrong read_timestamp or stale snapshot).

Background

The layered cursor merges two constituent cursors: an ingest cursor (local follower writes) and a stable cursor (checkpointed data). During next(), both cursors are advanced and __clayered_get_current() picks the one with the smaller key. The non-current cursor is left positioned at a key it has not yet returned.

Within a checkpoint, the btree cursor still performs per-key visibility checks using the session's current transaction context (read_timestamp / snapshot). The value stored in the cursor's buffer at the time next() was called reflects the visibility rules of that transaction.

Bug

Scenario:

Layered cursor scan is in progress. Stable cursor is positioned at key X (not yet returned); ingest cursor returned key Y where Y < X. WT_CLAYERED_ITERATE_NEXT is set.
Application commits transaction T1 and starts transaction T2 (different read_timestamp or new snapshot).
Application calls cursor.next().
In __clayered_should_advance_stable() ([cur_layered.c|src/cursor/cur_layered.c]):
- The stable cursor is not the current_cursor, so the guard at line 368–370 does not block.
- Non-timestamp path (line 383): if (iteration) return (false) — the stable cursor is not reopened. X's value in the cursor buffer was evaluated under T1's snapshot and is returned as-is under T2.
The ingest cursor advances. Eventually X becomes the smaller key and __clayered_get_current() selects the stable cursor. The value returned for X was evaluated under T1's visibility, which may be incorrect under T2.

Affected cases

Non-timestamp case (snapshot isolation)

__clayered_should_advance_stable returns false mid-iteration (line 383) regardless of snapshot change. The stable cursor's buffered value at X reflects T1's snapshot. If T2's snapshot changes what version of X is visible (e.g. because the checkpoint contains multiple versions), the wrong value is returned.

Timestamp case — current behavior vs. future optimization

Currently, when a read_timestamp is set, _clayered_should_advance_stable returns true unconditionally (line 379), causing the stable cursor to be reopened on the latest checkpoint on *every* next() call. clayered_advance_stable then duplicates the old cursor's position to the new cursor via _wt_cursor_dup_position, which triggers a fresh read under the new checkpoint and new read_timestamp. This incidentally fixes the visibility issue.

However, if this is optimized to only reopen when a new checkpoint is available (which is the natural performance improvement to pursue), the timestamp case would have the same bug: the stable cursor would remain positioned at X with its value evaluated under T1's read_timestamp. If T2 has a different read_timestamp, the value returned for X may correspond to a different historical version than what T2 should see.

Root cause

__clayered_iterate_constituents does not re-evaluate the visibility of a key that the stable cursor is already positioned on but has not yet returned. The value is computed once during next() on the stable constituent and cached; it is not recomputed when the surrounding transaction changes.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

wt_17031_root_cause_and_fix.md
6 kB
Apr 23 2026 07:46:01 AM UTC

is related to

WT-17030 Avoid reopening the stable table for each operation on follower

Closed

Assignee:: Alexander Pullen
Reporter:: Chenhao Qu
Votes:: 0 Vote for this issue
Watchers:: 7 Start watching this issue

Created:: Mar 26 2026 09:36:19 PM UTC
Updated:: May 13 2026 03:40:21 AM UTC
Resolved:: May 06 2026 08:31:40 AM UTC

Details

Description

Summary

Background

Bug

Affected cases

Non-timestamp case (snapshot isolation)

Timestamp case — current behavior vs. future optimization

Root cause

Attachments

Attachments

Issue Links

Activity

People

Dates