-
Type:
Bug
-
Resolution: Cannot Reproduce
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Cursors
-
None
-
Storage Engines
-
25.53
-
None
-
None
Problem
A disaggregated-storage follower mongod hard-aborts (SIGABRT) in a pure read path when a layered cursor is walking the table under a read timestamp and the follower picks up a newer checkpoint in which the key the stable cursor is parked on has been pruned.
The assertion is in __clayered_reopen_stable (src/cursor/cur_layered.c:563):
int __clayered_reopen_stable(...), 563: WiredTiger assertion failed:
'ret == 0 || !((((&clayered->iface)->flags) & (0x002000000ull)) != 0) ||
clayered->current_cursor == clayered->ingest_cursor'.
upgrading a positioned stable cursor
Crash stack (from a sys-perf tpcc_majority_out_of_cache run on disagg-m8g-perf-11-node.arm.aws, secondary node):
__wt_abort __clayered_enter __clayered_iterate __clayered_next mongo::FetchStage::doWork (aggregate on tpcc.STOCK, readConcern majority)
The crashing operation is a read-only aggregate; there is no insert/update/remove on the crashing cursor.
Root cause
__clayered_can_advance_stable (src/cursor/cur_layered.c:496) returns true unconditionally whenever a read timestamp is set:
if (txn_shared != NULL && txn_shared->read_timestamp != WT_TS_NONE) return (true); /* advances even while parked on stable mid-walk */
The no-read-timestamp branch immediately below already refuses to advance while the layered cursor is positioned on the stable constituent, but the read-timestamp branch skips that guard on the assumption that the history store makes a mid-iteration reopen always safe. That assumption is false: when the leader advances oldest past the reader's read timestamp and checkpoints, the newer checkpoint no longer retains the parked key. On the next next(), _clayered_reopen_stable calls _wt_cursor_dup_position() for the parked key, gets WT_NOTFOUND, and the layered cursor is still WT_CURSTD_KEY_INT on stable (not ingest) - tripping the assertion and aborting.
The existing WT_NOTFOUND recovery (clearing the iteration flags) is itself unsafe: it leaves the stable cursor unpositioned and can silently skip stable keys for the remainder of the walk.
Deterministic reproducer
A standalone Python test reproduces the abort with a pure-read walk and zero removes on the reading cursor (test file test/suite/test_layered_cursor_reopen_assert.py). Sequence:
- Leader seeds keys a, m, z into stable at ts=10; stable_timestamp=10; checkpoint; follower picks it up.
- Follower opens a layered cursor, begins a txn with read_timestamp=10, next() -> parks on stable key a.
- Leader deletes a, then advances stable_timestamp=20 and oldest_timestamp=20 (past the reader's read ts) and checkpoints, so the new checkpoint no longer retains a.
- Follower picks up the newer checkpoint.
- The still-parked follower reader calls next() -> __clayered_reopen_stable -> dup_position(a) = WT_NOTFOUND -> abort.
Verified to abort both with and without the unrelated __clayered_remove cursor-overwrite change, i.e. it reproduces on clean develop (HEAD 5522f730f7). The reproducer is deterministic because the Python driver controls the interleave: the checkpoint is picked up while the reader is parked, before the next next().
Proposed fix
Apply the same positioned-on-stable guard in the read-timestamp branch of __clayered_can_advance_stable:
if (txn_shared != NULL && txn_shared->read_timestamp != WT_TS_NONE) { /* * Even under a read timestamp, don't advance while parked on the stable * constituent mid-iteration. The parked key can be pruned from the newer * checkpoint (leader advanced oldest past our read timestamp), and we cannot * reposition the walk without risking skipped keys. Staying on the current * checkpoint is the consistent view for this fixed read timestamp anyway. */ if (F_ISSET(&clayered->iface, WT_CURSTD_KEY_INT) && clayered->current_cursor == clayered->stable_cursor) return (false); return (true); }
Not advancing while parked under a fixed read timestamp is correct: the reader should continue to see its consistent snapshot rather than jump to a checkpoint that pruned data visible at its read timestamp. The cost is holding the older checkpoint's dhandle for the duration of the walk - the resource tradeoff WT-17960 is evaluating. This must be reconciled with WT-17899 (which guards only the prepare-conflict variant) and the heavier "reposition the alternate via last-returned key" approach WT-17960 discusses.
Impact
Hard abort of a disaggregated-storage follower node under a common read (majority/nearest reads) + concurrent-delete + checkpoint-pickup pattern. Observed reliably on the sys-perf tpcc_majority_out_of_cache DSC workload. This is the general read-timestamp/checkpoint case that WT-17960 (a spike) flags as unaddressed - WT-17899 fixes only the prepare-conflict path and does not cover this.