Follower layered cursor aborts in __clayered_reopen_stable when a read-timestamped reader's parked stable key vanishes from a legitimately-adopted newer checkpoint

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Storage Engines
    • 19.597
    • None
    • None

      Problem

      A disaggregated-storage follower mongod hard-aborts (SIGABRT) in a pure read path when a layered cursor is walking the table under a read timestamp and the follower picks up a newer checkpoint in which the key the stable cursor is parked on is no longer present.

      Assertion, in __clayered_reopen_stable (src/cursor/cur_layered.c:563 at the time of writing):

      int __clayered_reopen_stable(...), 563: WiredTiger assertion failed:
      'ret == 0 || !((((&clayered->iface)->flags) & (0x002000000ull)) != 0) ||
       clayered->current_cursor == clayered->ingest_cursor'.
      upgrading a positioned stable cursor
      

      Originally observed on a sys-perf tpcc_majority_out_of_cache run (disagg-m8g-perf-11-node.arm.aws, secondary node) with this crash stack:

      __wt_abort
      __clayered_enter
      __clayered_iterate
      __clayered_next
      mongo::FetchStage::doWork   (aggregate on tpcc.STOCK, readConcern majority)
      

      The crashing operation is a read-only aggregate/scan; there is no insert/update/remove on the crashing cursor.

      Root cause

      __clayered_can_advance_stable (src/cursor/cur_layered.c:496, pre-fix) returns true unconditionally whenever a read timestamp is set, skipping the guard that otherwise refuses to advance while positioned on the stable constituent:

      if (txn_shared != NULL && txn_shared->read_timestamp != WT_TS_NONE)
          return (true);   /* bypasses the "positioned on stable" guard below */
      

      That bypass is premised on "the view at a timestamp is always consistent, the history store covers that." On the next next(), _clayered_reopen_stable calls _wt_cursor_dup_position() to reposition onto the new checkpoint; if that specific key's row is gone, dup_position returns WT_NOTFOUND, and the layered cursor is still WT_CURSTD_KEY_INT on stable (not ingest) - tripping the assertion.

      This is NOT the same bug as WT-17968

      WT-17968 covers a checkpoint-pickup pinned-timestamp panic that was dead code (checked an unpopulated struct field). It's tempting to assume fixing that would also fix this assertion, but it does not: this assertion fires even when every adopted checkpoint fully respects the pinned-timestamp invariant. Confirmed empirically - a repro that keeps the leader's/follower's oldest lag wide enough that the WT-17968 panic never fires (verified: grepped for the panic string across every failure, it never appears) still hits this __clayered_reopen_stable assertion reliably (8/8, then reconfirmed 3/3 with a minimal 3-thread version). The two bugs operate at different granularities:

      • WT-17968: a connection-wide check - does this checkpoint's oldest_timestamp respect the minimum pinned timestamp across all active readers.
      • This bug: a per-cursor/per-key hazard - a checkpoint can be fully valid at the connection-wide level while a leader's independent reconciliation/obsolescence decisions still happen to drop the one specific row a specific follower-side cursor is parked on. The leader's reconciliation has no visibility into which exact keys any follower-side cursor is parked on, so a connection-wide "your pin is respected" guarantee does not imply "every key you've ever visited is still there."

      Fix

      Hoist the positioned-on-stable guard so it applies regardless of read timestamp, in __clayered_can_advance_stable:

      if (F_ISSET(&clayered->iface, WT_CURSTD_KEY_INT) &&
        clayered->current_cursor == clayered->stable_cursor)
          return (false);
      

      placed before the read-timestamp fast path, rather than only inside the no-read-timestamp else branch. Refusing to advance while parked on stable is also the more correct behavior for a fixed read timestamp: the reader should keep seeing its consistent view rather than jump to a checkpoint that dropped data it's positioned on. The cost is holding the older checkpoint's dhandle open a bit longer for an in-progress walk.

      Deterministic-enough reproducer

      test/suite/test_layered_cursor_reopen_stable_minimal.py - a minimal 3-thread repro (leader writer+checkpointer, follower checkpoint-picker, single follower read-timestamped scanner) using a single shared timestamp clock across leader and follower (representing the one global oplog timestamp sequence a real cluster has) with a wide oldest lag, so the WT-17968 panic is deliberately never triggered. Verified:

      • Fix reverted: 3/3 runs SIGABRT with the assertion above.
      • Fix applied: 3/3 runs clean.

      A larger 7-thread variant matching TPCC's actual shape (including a follower "oplog apply" writer using overwrite=true removes) lives in test/suite/test_layered_cursor_tpcc_repro.py, used to separately confirm this bug is unrelated to the __clayered_remove cursor-overwrite change from BF-43334 (crashes identically with that change present or reverted).

      Related tickets

      Same general fragile area as WT-17899 (prepare-conflict-stall variant of stable reopen, in code review), WT-17960 (spike on repositioning stable across context changes), and WT-17923 (iteration-flag centralization refactor, in code review) - but this specific trigger (a plain read-timestamped scan, no prepare involved) and this specific assertion don't appear to be covered by any of them yet.

            Assignee:
            [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            Haribabu Kommi
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: