Standby crashes with invariantWTOK when checkpoint pick-up installs a pre-rotation checkpoint after KEK rotation oplog entry is already applied

XMLWordPrintableJSON

    • Server Security
    • ALL
    • Hide

      Run pali_chaos_kek.js (the SERVER-129742 chaos test that exercises KEK rotation concurrently with SLS fault injection). The crash is non-deterministic but occurs reliably within one 10-minute chaos run when KEK rotation fires while a checkpoint is in-flight on the primary.

      python3 buildscripts/resmoke.py run --suites=disagg_pali_chaos \
          src/mongo/db/modules/atlas/jstests/pali_chaos/pali_chaos_kek.js  

      The attached log pali.log.txt shows one instance of running the JS test on a local Ubuntu 22.04 build machine. 

      Or, if the test has not yet been wired to run automatically in Evergreen, you can run this manually after SERVER-129742 is merged:

      evergreen patch -p mongodb-mongo-dsc-release-master \
        --variants atlas-amazon2023-arm64 \
        --tasks disagg_pali_chaos_kek 
      Show
      Run pali_chaos_kek.js (the SERVER-129742 chaos test that exercises KEK rotation concurrently with SLS fault injection). The crash is non-deterministic but occurs reliably within one 10-minute chaos run when KEK rotation fires while a checkpoint is in-flight on the primary. python3 buildscripts/resmoke.py run --suites=disagg_pali_chaos \     src/mongo/db/modules/atlas/jstests/pali_chaos/pali_chaos_kek.js  The attached log pali.log.txt shows one instance of running the JS test on a local Ubuntu 22.04 build machine.  Or, if the test has not yet been wired to run automatically in Evergreen, you can run this manually after SERVER-129742 is merged: evergreen patch -p mongodb-mongo-dsc-release-master \   --variants atlas-amazon2023-arm64 \   --tasks disagg_pali_chaos_kek
    • Server Security 2026-07-03
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      pali.log.txt

      The standby node crashes with a fatal invariant failure when two conditions coincide:

      1. A KEK rotation oplog entry has been applied, advancing the standby's in-memory keystore to KEK v2.
      2. The checkpoint pick-up thread simultaneously tries to install a checkpoint that was taken on the primary after the in-memory keystore was updated but before the updated keystore was flushed to WiredTiger. That checkpoint therefore contains the v1 keystore LSN.

      WiredTiger detects the stale keystore reference and returns EINVAL. The invariantWTOK wrapper in wiredtiger_kv_engine.cpp:2435 turns this into an unconditional abort().

      Observed Behavior

      Standby process aborts. Chaos controller detects the dead standby and reports:

      [MONGOD BUG] Standby unreachable after rapid_seal_burst
      [MONGOD BUG] validate: Standby unreachable for validate
      [MONGOD BUG] dbHash: Standby unreachable for dbHash
      [PERF] standby UNREACHABLE for 24/41 steady-state samples 

      Log Evidence

       
      From the attached standby log (port 20061), in chronological order:

      01:11:38.817  -- First KEK rotation completes on primary (KEK 1 → 2); oplog entry applied on standby; in-memory keystore now at LSN 7655505607113310340
      
      01:11:43.883  id:40414  Failed to parse KEK Keystore from WiredTiger: "loadFromWT: persisted keystore timestamp 7655505345120305155 is behind last committed timestamp 7655505607113310340"
      
      01:11:43.886  id:11722321  loadKey: Failed to load keys from WT 
          WT: Failed to pick up disaggregated storage checkpoint for
              metadata_lsn=7655505607113310321: ret=22
          WT: int __wti_disagg_load_crypt_key: key_provider->load_key failed
          WT: int __disagg_pick_up_checkpoint: __wti_disagg_load_crypt_key failed
          WT: Error at conn/conn_reconfig.c:449: "__wti_disagg_conn_config(session, cfg, true)" failed: EINVAL (22)
      
      01:11:43.892  id:23083  Invariant failure: "_conn->reconfigure(_conn, getCkptMetaConfigString.c_str())" error "BadValue: 22: Invalid argument"
         id:23084  aborting after invariant() failure
         id:6384300  Got signal: 6 (Aborted) 

      Root Cause

      The primary takes a checkpoint at time T. At that moment:

      • The primary's in-memory keystore has already been updated to KEK v2 (oplog write complete).
      • The WiredTiger keystore on-disk has not yet been updated (it's updated at the next checkpoint).

      So the checkpoint at time T embeds the v1 keystore LSN (7655505345120305155) even though the primary's in-memory state is v2.

      When the standby:

      1. Applies the KEK rotation oplog entry → its in-memory keystore advances to v2 (LSN 7655505607113310340).
      2. Picks up the checkpoint from time T via setRecoveryCheckpointMetadata → calls _conn->reconfigure() with metadata_lsn pointing to a checkpoint whose embedded keystore is at v1.

      WiredTiger's __wti_disagg_load_crypt_key refuses to load a keystore whose LSN is behind the already-committed in-memory LSN, returning EINVAL. This is the correct behavior from WT's perspective — the keystore appears to have gone backwards.

      The bug is that invariantWTOK in WiredTigerKVEngine::setRecoveryCheckpointMetadata (wiredtiger_kv_engine.cpp:2435) treats this recoverable error as fatal:

      invariantWTOK(_conn->reconfigure(_conn, getCkptMetaConfigString.c_str()), nullptr); 

      Expected Behavior

      The standby should not crash. Options to fix (in order of invasiveness):

      1. Preferred — skip stale checkpoints: When _conn->reconfigure() returns EINVAL with a stale-keystore diagnostic, log a warning and skip that checkpoint. The standby will pick up the next checkpoint, which will have been taken after the WiredTiger keystore flush and will be consistent.
      2. Defer checkpoint pick-up until keystore flush: On the primary, do not make a checkpoint visible for standby pick-up until the WiredTiger keystore has been flushed to reflect the current KEK.
      3. Tolerate backward LSN on standby during oplog replay: When applying a KEK rotation oplog entry, also advance the WiredTiger keystore LSN floor so that the stale check in __wti_disagg_load_crypt_key does not reject it.

         

      Additional Context

      • This crash cannot be triggered by pali_chaos.js because that test uses a static key file with no dynamic keystore updates.
      • The rapid_seal_burst event reported in the test output is coincidental — it was in-progress when the chaos controller's liveness probe detected the already-dead standby (~30 seconds after the actual crash).
      • The rotation driver confirmed 2 successful KEK rotations (kekAttempted=2, kekCompleted=2) before the crash, so the driver is correctly exercising the code path.

        1. pali.log.txt
          92.11 MB
          Chye Lin Chee

            Assignee:
            Gabriel Marks
            Reporter:
            Chye Lin Chee
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: