Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Cannot Reproduce
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Cursors
Labels:
None

Assigned Teams:

Storage Engines
Total Hours with Assigned Team:
565.519
Sprint:
None
Story Points:
None

Problem

A disaggregated-storage follower mongod hard-aborts (SIGABRT) in a pure read path when a layered cursor is walking the table under a read timestamp and the follower picks up a newer checkpoint in which the key the stable cursor is parked on has been pruned.

The assertion is in __clayered_reopen_stable (src/cursor/cur_layered.c:563):

int __clayered_reopen_stable(...), 563: WiredTiger assertion failed:
'ret == 0 || !((((&clayered->iface)->flags) & (0x002000000ull)) != 0) ||
 clayered->current_cursor == clayered->ingest_cursor'.
upgrading a positioned stable cursor

Crash stack (from a sys-perf tpcc_majority_out_of_cache run on disagg-m8g-perf-11-node.arm.aws, secondary node):

__wt_abort
__clayered_enter
__clayered_iterate
__clayered_next
mongo::FetchStage::doWork   (aggregate on tpcc.STOCK, readConcern majority)

The crashing operation is a read-only aggregate; there is no insert/update/remove on the crashing cursor.

Root cause

__clayered_can_advance_stable (src/cursor/cur_layered.c:496) returns true unconditionally whenever a read timestamp is set:

if (txn_shared != NULL && txn_shared->read_timestamp != WT_TS_NONE)
    return (true);   /* advances even while parked on stable mid-walk */

The no-read-timestamp branch immediately below already refuses to advance while the layered cursor is positioned on the stable constituent, but the read-timestamp branch skips that guard on the assumption that the history store makes a mid-iteration reopen always safe. That assumption is false: when the leader advances oldest past the reader's read timestamp and checkpoints, the newer checkpoint no longer retains the parked key. On the next next(), _clayered_reopen_stable calls _wt_cursor_dup_position() for the parked key, gets WT_NOTFOUND, and the layered cursor is still WT_CURSTD_KEY_INT on stable (not ingest) - tripping the assertion and aborting.

The existing WT_NOTFOUND recovery (clearing the iteration flags) is itself unsafe: it leaves the stable cursor unpositioned and can silently skip stable keys for the remainder of the walk.

Deterministic reproducer

A standalone Python test reproduces the abort with a pure-read walk and zero removes on the reading cursor (test file test/suite/test_layered_cursor_reopen_assert.py). Sequence:

Leader seeds keys a, m, z into stable at ts=10; stable_timestamp=10; checkpoint; follower picks it up.
Follower opens a layered cursor, begins a txn with read_timestamp=10, next() -> parks on stable key a.
Leader deletes a, then advances stable_timestamp=20 and oldest_timestamp=20 (past the reader's read ts) and checkpoints, so the new checkpoint no longer retains a.
Follower picks up the newer checkpoint.
The still-parked follower reader calls next() -> __clayered_reopen_stable -> dup_position(a) = WT_NOTFOUND -> abort.

Verified to abort both with and without the unrelated __clayered_remove cursor-overwrite change, i.e. it reproduces on clean develop (HEAD 5522f730f7). The reproducer is deterministic because the Python driver controls the interleave: the checkpoint is picked up while the reader is parked, before the next next().

Proposed fix

Apply the same positioned-on-stable guard in the read-timestamp branch of __clayered_can_advance_stable:

if (txn_shared != NULL && txn_shared->read_timestamp != WT_TS_NONE) {
    /*
     * Even under a read timestamp, don't advance while parked on the stable
     * constituent mid-iteration. The parked key can be pruned from the newer
     * checkpoint (leader advanced oldest past our read timestamp), and we cannot
     * reposition the walk without risking skipped keys. Staying on the current
     * checkpoint is the consistent view for this fixed read timestamp anyway.
     */
    if (F_ISSET(&clayered->iface, WT_CURSTD_KEY_INT) &&
      clayered->current_cursor == clayered->stable_cursor)
        return (false);
    return (true);
}

Not advancing while parked under a fixed read timestamp is correct: the reader should continue to see its consistent snapshot rather than jump to a checkpoint that pruned data visible at its read timestamp. The cost is holding the older checkpoint's dhandle for the duration of the walk - the resource tradeoff WT-17960 is evaluating. This must be reconciled with ~~WT-17899~~ (which guards only the prepare-conflict variant) and the heavier "reposition the alternate via last-returned key" approach WT-17960 discusses.

Impact

Hard abort of a disaggregated-storage follower node under a common read (majority/nearest reads) + concurrent-delete + checkpoint-pickup pattern. Observed reliably on the sys-perf tpcc_majority_out_of_cache DSC workload. This is the general read-timestamp/checkpoint case that WT-17960 (a spike) flags as unaddressed - ~~WT-17899~~ fixes only the prepare-conflict path and does not cover this.

is related to

WT-17899 Prevent the reopening of stable during a prepare conflict stall

Closed

related to

WT-17899 Prevent the reopening of stable during a prepare conflict stall

Closed

WT-17923 Layered cursor: centralize iteration flag management

In Code Review

Assignee:: [DO NOT USE] Backlog - Storage Engines Team
Reporter:: Haribabu Kommi
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: Jul 02 2026 07:16:37 AM UTC
Updated:: Jul 02 2026 12:36:13 PM UTC
Resolved:: Jul 02 2026 12:36:07 PM UTC

Details

Description

Problem

Root cause

Deterministic reproducer

Proposed fix

Impact

Attachments

Issue Links

Activity

People

Dates