-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Storage Engines
-
19.597
-
None
-
None
Problem
A disaggregated-storage follower mongod hard-aborts (SIGABRT) in a pure read path when a layered cursor is walking the table under a read timestamp and the follower picks up a newer checkpoint in which the key the stable cursor is parked on is no longer present.
Assertion, in __clayered_reopen_stable (src/cursor/cur_layered.c:563 at the time of writing):
int __clayered_reopen_stable(...), 563: WiredTiger assertion failed:
'ret == 0 || !((((&clayered->iface)->flags) & (0x002000000ull)) != 0) ||
clayered->current_cursor == clayered->ingest_cursor'.
upgrading a positioned stable cursor
Originally observed on a sys-perf tpcc_majority_out_of_cache run (disagg-m8g-perf-11-node.arm.aws, secondary node) with this crash stack:
__wt_abort __clayered_enter __clayered_iterate __clayered_next mongo::FetchStage::doWork (aggregate on tpcc.STOCK, readConcern majority)
The crashing operation is a read-only aggregate/scan; there is no insert/update/remove on the crashing cursor.
Root cause
__clayered_can_advance_stable (src/cursor/cur_layered.c:496, pre-fix) returns true unconditionally whenever a read timestamp is set, skipping the guard that otherwise refuses to advance while positioned on the stable constituent:
if (txn_shared != NULL && txn_shared->read_timestamp != WT_TS_NONE) return (true); /* bypasses the "positioned on stable" guard below */
That bypass is premised on "the view at a timestamp is always consistent, the history store covers that." On the next next(), _clayered_reopen_stable calls _wt_cursor_dup_position() to reposition onto the new checkpoint; if that specific key's row is gone, dup_position returns WT_NOTFOUND, and the layered cursor is still WT_CURSTD_KEY_INT on stable (not ingest) - tripping the assertion.
This is NOT the same bug as WT-17968
WT-17968 covers a checkpoint-pickup pinned-timestamp panic that was dead code (checked an unpopulated struct field). It's tempting to assume fixing that would also fix this assertion, but it does not: this assertion fires even when every adopted checkpoint fully respects the pinned-timestamp invariant. Confirmed empirically - a repro that keeps the leader's/follower's oldest lag wide enough that the WT-17968 panic never fires (verified: grepped for the panic string across every failure, it never appears) still hits this __clayered_reopen_stable assertion reliably (8/8, then reconfirmed 3/3 with a minimal 3-thread version). The two bugs operate at different granularities:
WT-17968: a connection-wide check - does this checkpoint's oldest_timestamp respect the minimum pinned timestamp across all active readers.- This bug: a per-cursor/per-key hazard - a checkpoint can be fully valid at the connection-wide level while a leader's independent reconciliation/obsolescence decisions still happen to drop the one specific row a specific follower-side cursor is parked on. The leader's reconciliation has no visibility into which exact keys any follower-side cursor is parked on, so a connection-wide "your pin is respected" guarantee does not imply "every key you've ever visited is still there."
Fix
Hoist the positioned-on-stable guard so it applies regardless of read timestamp, in __clayered_can_advance_stable:
if (F_ISSET(&clayered->iface, WT_CURSTD_KEY_INT) && clayered->current_cursor == clayered->stable_cursor) return (false);
placed before the read-timestamp fast path, rather than only inside the no-read-timestamp else branch. Refusing to advance while parked on stable is also the more correct behavior for a fixed read timestamp: the reader should keep seeing its consistent view rather than jump to a checkpoint that dropped data it's positioned on. The cost is holding the older checkpoint's dhandle open a bit longer for an in-progress walk.
Deterministic-enough reproducer
test/suite/test_layered_cursor_reopen_stable_minimal.py - a minimal 3-thread repro (leader writer+checkpointer, follower checkpoint-picker, single follower read-timestamped scanner) using a single shared timestamp clock across leader and follower (representing the one global oplog timestamp sequence a real cluster has) with a wide oldest lag, so the WT-17968 panic is deliberately never triggered. Verified:
- Fix reverted: 3/3 runs SIGABRT with the assertion above.
- Fix applied: 3/3 runs clean.
A larger 7-thread variant matching TPCC's actual shape (including a follower "oplog apply" writer using overwrite=true removes) lives in test/suite/test_layered_cursor_tpcc_repro.py, used to separately confirm this bug is unrelated to the __clayered_remove cursor-overwrite change from BF-43334 (crashes identically with that change present or reverted).
Related tickets
Same general fragile area as WT-17899 (prepare-conflict-stall variant of stable reopen, in code review), WT-17960 (spike on repositioning stable across context changes), and WT-17923 (iteration-flag centralization refactor, in code review) - but this specific trigger (a plain read-timestamped scan, no prepare involved) and this specific assertion don't appear to be covered by any of them yet.
- is related to
-
WT-17899 Prevent the reopening of stable during a prepare conflict stall
-
- Closed
-
-
WT-17968 Disaggregated storage checkpoint pick-up pinned-timestamp panic check reads unpopulated metadata (dead code)
-
- Closed
-
-
WT-17923 Layered cursor: centralize iteration flag management
-
- In Code Review
-
- related to
-
WT-17899 Prevent the reopening of stable during a prepare conflict stall
-
- Closed
-
-
WT-17968 Disaggregated storage checkpoint pick-up pinned-timestamp panic check reads unpopulated metadata (dead code)
-
- Closed
-
-
WT-17923 Layered cursor: centralize iteration flag management
-
- In Code Review
-