Standby fatal abort when checkpoint pick-up encounters stale keystore LSN after KEK rotation

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Server Security
    • ALL
    • Hide
      1. Build the server with featureFlagKEKPushMode and featureFlagCMKRotation enabled.
      2. Run: python3 buildscripts/resmoke.py run --suites=disagg_pali_chaos \
             src/mongo/db/modules/atlas/jstests/pali_chaos/pali_chaos_kek.js
      3. Wait for the rotation driver thread to fire eseRotateActiveKEK concurrently with a  kill_standby_page_server or rapid_seal_burst chaos event.
      4. Observe standby SIGABRT at wiredtiger_kv_engine.cpp:2435.

      Seed 1234 reproduces run 2 within ≈5 minutes of chaos phase start.

      Show
      Build the server with featureFlagKEKPushMode and featureFlagCMKRotation enabled. Run: python3 buildscripts/resmoke.py run --suites=disagg_pali_chaos \      src/mongo/db/modules/atlas/jstests/pali_chaos/pali_chaos_kek.js Wait for the rotation driver thread to fire eseRotateActiveKEK concurrently with a  kill_standby_page_server or rapid_seal_burst chaos event. Observe standby SIGABRT at wiredtiger_kv_engine.cpp:2435. Seed 1234 reproduces run 2 within ≈5 minutes of chaos phase start.
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      palichaoskek2.log.txt

      (Test is run in push mode)

      A disaggregated-storage standby crashes unconditionally when WiredTiger's checkpoint pick-up thread tries to install a checkpoint whose embedded keystore LSN is stale relative to the standby's in-memory committed keystore, as happens after a KEK rotation.

      The crash site is `setRecoveryCheckpointMetadata` in wiredtiger_kv_engine.cpp, which calls `invariantWTOK(_conn->reconfigure(...))`. When the disagg checkpoint pick-up returns EINVAL(22) for a keystore LSN mismatch, `invariantWTOK` converts the error to a fatal abort with no recovery path. The server should instead treat this as a retriable error or perform a clean step-down.

      The bug has been reproduced twice in pali_chaos_kek.js under two different chaos events, confirming it is not flaky — any chaos event that triggers a checkpoint pick-up while a KEK rotation is in flight will hit it:

      • Run 1 trigger: rapid_seal_burst — a seal burst prompted a checkpoint that landed in the KEK rotation race window.
      • Run 2 trigger: kill_standby_page_server — killing the page servers immediately prompted a checkpoint pick-up attempt that landed in the same race window (≈2 s after KEK v3 rotation completed).

      In both runs the exact crash site, error code, and WiredTiger call stack are identical.

      Race condition

      The race window exists on the primary between two events:

      • T1: MongoDB layer commits KEK vN to its in-memory keystore and reserves an oplog timestamp for the push-mode keystore write (log ID 12623630).
      • T2: The keystore is flushed to WiredTiger at that reserved oplog timestamp.

      A checkpoint snapshotted between T1 and T2 has the vN-1 keystore LSN embedded in it.

      On the standby:

      1. The standby applies the KEK vN rotation oplog entry, advancing its in-memory keystore to vN.
      2. The standby's checkpoint pick-up thread (WiredTiger session "checkpoint-pick-up") tries to install a checkpoint that was snapshotted in the T1–T2 window (keystore LSN still at vN-1).
      3. `__wti_disagg_load_crypt_key` (conn_layered_page_log.c:295) detects the mismatch: checkpoint's embedded keystore LSN < in-memory committed keystore version → EINVAL(22).
      4. EINVAL propagates: __disagg_pick_up_checkpoint (943) → __wti_disagg_pick_up_checkpoint_meta (1093) → __wti_disagg_conn_config (conn_layered.c:1235) → conn_reconfig.c:449.
      5. `invariantWTOK` at wiredtiger_kv_engine.cpp:2435 converts EINVAL to a fatal abort.

      Crash evidence (run 2)

      Log file: palichaoskek2.log.txt (attached)

      Chaos event: kill_standby_page_server (evt-008), injected ≈17:22:35 UTC
      KEK v3 rotation completed on primary: 17:22:33.330 UTC

      Relevant log lines (standby port 20061):

      17:22:35.791  WiredTiger ERROR (error=0, ctx=Disagg-5, session=checkpoint-pick-up): "Failed to pick up disaggregated storage checkpoint for metadata_lsn=7655755784663335110: ret=22"
      
      17:22:35.792  WiredTiger ERROR (error=22, ctx=Disagg-5, session=WT_CONNECTION.reconfigure): "int __wti_disagg_conn_config(...):1235: Failed to pick up a new checkpoint with config: metadata_lsn=7655755784663335110, metadata_checksum=4f8ebbd8, database_size=9993408, version=1, compatible_version=1  error_str=Invalid argument  error_code=22"
      
      17:22:35.799  CheckpointManager::_updateCheckpointIfAvailable (ctx=Disagg-3):
        latestMaterializedAndAppliedCheckpointLsn: 7655755810433138789
        _installedCheckpointLsn:                   7655755763188498436
        _checkpointToInstallLsn:                   7655755784663335113  ← the failing checkpoint
      
      17:22:35.836  MONGOD ABORT: "Invariant failure: \"_conn->reconfigure(_conn, getCkptMetaConfigString.c_str())\"error \"BadValue: 22: Invalid argument\"" at wiredtiger_kv_engine.cpp:2435 Got signal: 6 (Aborted) 

      Note on WT_NOTFOUND errors: a flood of WT_NOTFOUND (-31803) errors from __cursor_row_next, __curfile_next, __wt_meta_checkpoint_last_name, and __layered_last_checkpoint_order:1233 appears AFTER the EINVAL. These are secondary: they occur during WiredTiger's cleanup/error-unwind path for the failed WT_CONNECTION.reconfigure session, at which point the page servers (just killed by the chaos event) make checkpoint metadata table lookups return WT_NOTFOUND. They do not propagate to invariantWTOK and are not the root cause.

      Expected behavior

      `setRecoveryCheckpointMetadata` should handle EINVAL from a stale keystore LSN without calling `invariantWTOK`. The correct behavior is one of:

      • Skip the stale checkpoint and wait for a newer checkpoint whose keystore LSN is compatible with the current in-memory keystore, or
      • Perform a clean step-down and let the standby re-sync rather than aborting.

      In neither case should a keystore LSN mismatch during checkpoint pick-up be a fatal server error.

      Affected versions

      Observed on: current master (branch chyelin/SERVER-129742-chaos, 2026-06-26)
      Feature flags: featureFlagKEKPushMode (default false, enabled in test),
                     featureFlagCMKRotation (default true, FCV 9.0)

      See also

      Related test: src/mongo/db/modules/atlas/jstests/pali_chaos/pali_chaos_kek.js
      Crash site:   src/mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp:2435
                    (setRecoveryCheckpointMetadata, invariantWTOK(_conn->reconfigure(...)))

       

        1. palichaoskek2.log.txt
          12 kB
          Chye Lin Chee

            Assignee:
            Unassigned
            Reporter:
            Chye Lin Chee
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: