Follower layered cursor aborts in __clayered_reopen_stable when a read-timestamped walk's parked stable key is pruned by a picked-up checkpoint

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Cannot Reproduce
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Cursors
    • None
    • Storage Engines
    • 25.53
    • None
    • None

      Problem

      A disaggregated-storage follower mongod hard-aborts (SIGABRT) in a pure read path when a layered cursor is walking the table under a read timestamp and the follower picks up a newer checkpoint in which the key the stable cursor is parked on has been pruned.

      The assertion is in __clayered_reopen_stable (src/cursor/cur_layered.c:563):

      int __clayered_reopen_stable(...), 563: WiredTiger assertion failed:
      'ret == 0 || !((((&clayered->iface)->flags) & (0x002000000ull)) != 0) ||
       clayered->current_cursor == clayered->ingest_cursor'.
      upgrading a positioned stable cursor
      

      Crash stack (from a sys-perf tpcc_majority_out_of_cache run on disagg-m8g-perf-11-node.arm.aws, secondary node):

      __wt_abort
      __clayered_enter
      __clayered_iterate
      __clayered_next
      mongo::FetchStage::doWork   (aggregate on tpcc.STOCK, readConcern majority)
      

      The crashing operation is a read-only aggregate; there is no insert/update/remove on the crashing cursor.

      Root cause

      __clayered_can_advance_stable (src/cursor/cur_layered.c:496) returns true unconditionally whenever a read timestamp is set:

      if (txn_shared != NULL && txn_shared->read_timestamp != WT_TS_NONE)
          return (true);   /* advances even while parked on stable mid-walk */
      

      The no-read-timestamp branch immediately below already refuses to advance while the layered cursor is positioned on the stable constituent, but the read-timestamp branch skips that guard on the assumption that the history store makes a mid-iteration reopen always safe. That assumption is false: when the leader advances oldest past the reader's read timestamp and checkpoints, the newer checkpoint no longer retains the parked key. On the next next(), _clayered_reopen_stable calls _wt_cursor_dup_position() for the parked key, gets WT_NOTFOUND, and the layered cursor is still WT_CURSTD_KEY_INT on stable (not ingest) - tripping the assertion and aborting.

      The existing WT_NOTFOUND recovery (clearing the iteration flags) is itself unsafe: it leaves the stable cursor unpositioned and can silently skip stable keys for the remainder of the walk.

      Deterministic reproducer

      A standalone Python test reproduces the abort with a pure-read walk and zero removes on the reading cursor (test file test/suite/test_layered_cursor_reopen_assert.py). Sequence:

      1. Leader seeds keys a, m, z into stable at ts=10; stable_timestamp=10; checkpoint; follower picks it up.
      2. Follower opens a layered cursor, begins a txn with read_timestamp=10, next() -> parks on stable key a.
      3. Leader deletes a, then advances stable_timestamp=20 and oldest_timestamp=20 (past the reader's read ts) and checkpoints, so the new checkpoint no longer retains a.
      4. Follower picks up the newer checkpoint.
      5. The still-parked follower reader calls next() -> __clayered_reopen_stable -> dup_position(a) = WT_NOTFOUND -> abort.

      Verified to abort both with and without the unrelated __clayered_remove cursor-overwrite change, i.e. it reproduces on clean develop (HEAD 5522f730f7). The reproducer is deterministic because the Python driver controls the interleave: the checkpoint is picked up while the reader is parked, before the next next().

      Proposed fix

      Apply the same positioned-on-stable guard in the read-timestamp branch of __clayered_can_advance_stable:

      if (txn_shared != NULL && txn_shared->read_timestamp != WT_TS_NONE) {
          /*
           * Even under a read timestamp, don't advance while parked on the stable
           * constituent mid-iteration. The parked key can be pruned from the newer
           * checkpoint (leader advanced oldest past our read timestamp), and we cannot
           * reposition the walk without risking skipped keys. Staying on the current
           * checkpoint is the consistent view for this fixed read timestamp anyway.
           */
          if (F_ISSET(&clayered->iface, WT_CURSTD_KEY_INT) &&
            clayered->current_cursor == clayered->stable_cursor)
              return (false);
          return (true);
      }
      

      Not advancing while parked under a fixed read timestamp is correct: the reader should continue to see its consistent snapshot rather than jump to a checkpoint that pruned data visible at its read timestamp. The cost is holding the older checkpoint's dhandle for the duration of the walk - the resource tradeoff WT-17960 is evaluating. This must be reconciled with WT-17899 (which guards only the prepare-conflict variant) and the heavier "reposition the alternate via last-returned key" approach WT-17960 discusses.

      Impact

      Hard abort of a disaggregated-storage follower node under a common read (majority/nearest reads) + concurrent-delete + checkpoint-pickup pattern. Observed reliably on the sys-perf tpcc_majority_out_of_cache DSC workload. This is the general read-timestamp/checkpoint case that WT-17960 (a spike) flags as unaddressed - WT-17899 fixes only the prepare-conflict path and does not cover this.

            Assignee:
            [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            Haribabu Kommi
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: