Loading...

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: WT12.0.0
Affects Version/s: None
Component/s: Cursors
Labels:
None

Assigned Teams:

Storage Engines - Foundations
Total Hours with Assigned Team:
43.83
Sprint:
None
Story Points:
None

Problem

A disaggregated-storage follower mongod hard-aborts (SIGABRT) in a pure read path when a layered cursor is walking the table under a read timestamp and the follower picks up a newer checkpoint in which the key the stable cursor is parked on is no longer present.

Assertion, in __clayered_reopen_stable (src/cursor/cur_layered.c:563 at the time of writing):

int __clayered_reopen_stable(...), 563: WiredTiger assertion failed:
'ret == 0 || !((((&clayered->iface)->flags) & (0x002000000ull)) != 0) ||
 clayered->current_cursor == clayered->ingest_cursor'.
upgrading a positioned stable cursor

Originally observed on a sys-perf tpcc_majority_out_of_cache run (disagg-m8g-perf-11-node.arm.aws, secondary node) with this crash stack:

__wt_abort
__clayered_enter
__clayered_iterate
__clayered_next
mongo::FetchStage::doWork   (aggregate on tpcc.STOCK, readConcern majority)

The crashing operation is a read-only aggregate/scan; there is no insert/update/remove on the crashing cursor.

Root cause

__clayered_can_advance_stable (src/cursor/cur_layered.c:496, pre-fix) returns true unconditionally whenever a read timestamp is set, skipping the guard that otherwise refuses to advance while positioned on the stable constituent:

if (txn_shared != NULL && txn_shared->read_timestamp != WT_TS_NONE)
    return (true);   /* bypasses the "positioned on stable" guard below */

That bypass is premised on "the view at a timestamp is always consistent, the history store covers that." On the next next(), _clayered_reopen_stable calls _wt_cursor_dup_position() to reposition onto the new checkpoint; if that specific key's row is gone, dup_position returns WT_NOTFOUND, and the layered cursor is still WT_CURSTD_KEY_INT on stable (not ingest) - tripping the assertion.

This is NOT the same bug as WT-17968

WT-17968 covers a checkpoint-pickup pinned-timestamp panic that was dead code (checked an unpopulated struct field). It's tempting to assume fixing that would also fix this assertion, but it does not: this assertion fires even when every adopted checkpoint fully respects the pinned-timestamp invariant. Confirmed empirically - a repro that keeps the leader's/follower's oldest lag wide enough that the WT-17968 panic never fires (verified: grepped for the panic string across every failure, it never appears) still hits this __clayered_reopen_stable assertion reliably (8/8, then reconfirmed 3/3 with a minimal 3-thread version). The two bugs operate at different granularities:

WT-17968: a connection-wide check - does this checkpoint's oldest_timestamp respect the minimum pinned timestamp across all active readers.
This bug: a per-cursor/per-key hazard - a checkpoint can be fully valid at the connection-wide level while a leader's independent reconciliation/obsolescence decisions still happen to drop the one specific row a specific follower-side cursor is parked on. The leader's reconciliation has no visibility into which exact keys any follower-side cursor is parked on, so a connection-wide "your pin is respected" guarantee does not imply "every key you've ever visited is still there."

Fix

Hoist the positioned-on-stable guard so it applies regardless of read timestamp, in __clayered_can_advance_stable:

if (F_ISSET(&clayered->iface, WT_CURSTD_KEY_INT) &&
  clayered->current_cursor == clayered->stable_cursor)
    return (false);

placed before the read-timestamp fast path, rather than only inside the no-read-timestamp else branch. Refusing to advance while parked on stable is also the more correct behavior for a fixed read timestamp: the reader should keep seeing its consistent view rather than jump to a checkpoint that dropped data it's positioned on. The cost is holding the older checkpoint's dhandle open a bit longer for an in-progress walk.

Deterministic-enough reproducer

test/suite/test_layered_cursor_reopen_stable_minimal.py - a minimal 3-thread repro (leader writer+checkpointer, follower checkpoint-picker, single follower read-timestamped scanner) using a single shared timestamp clock across leader and follower (representing the one global oplog timestamp sequence a real cluster has) with a wide oldest lag, so the WT-17968 panic is deliberately never triggered. Verified:

Fix reverted: 3/3 runs SIGABRT with the assertion above.
Fix applied: 3/3 runs clean.

A larger 7-thread variant matching TPCC's actual shape (including a follower "oplog apply" writer using overwrite=true removes) lives in test/suite/test_layered_cursor_tpcc_repro.py, used to separately confirm this bug is unrelated to the __clayered_remove cursor-overwrite change from BF-43334 (crashes identically with that change present or reverted).

Related tickets

Same general fragile area as ~~WT-17899~~ (prepare-conflict-stall variant of stable reopen, in code review), WT-17960 (spike on repositioning stable across context changes), and WT-17923 (iteration-flag centralization refactor, in code review) - but this specific trigger (a plain read-timestamped scan, no prepare involved) and this specific assertion don't appear to be covered by any of them yet.

is related to

WT-17968 Disaggregated storage checkpoint pick-up pinned-timestamp panic check reads unpopulated metadata (dead code)

Blocked

WT-17899 Prevent the reopening of stable during a prepare conflict stall

Closed

WT-17923 Layered cursor: centralize iteration flag management

In Code Review

WT-17994 Rework fragile iteration state logic in __clayered_iterate_constituents()

Open

related to

WT-17968 Disaggregated storage checkpoint pick-up pinned-timestamp panic check reads unpopulated metadata (dead code)

Blocked

WT-17899 Prevent the reopening of stable during a prepare conflict stall

Closed

WT-17923 Layered cursor: centralize iteration flag management

In Code Review

WT-17994 Rework fragile iteration state logic in __clayered_iterate_constituents()

Open

(3 related to)

Details

Description

Problem

Root cause

This is NOT the same bug as WT-17968

Fix

Deterministic-enough reproducer

Related tickets

Attachments

Issue Links

Activity

People

Dates