PALI Chaos Test: coverage for dynamic KEK generation, KEK rotation, and CMK rotation

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Server Security
    • Server Security 2026-07-03
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      pali_chaos.js tests SLS infrastructure resilience (log leader kills, seals, zone outages, etc.) with a static KEK. It has no coverage of the encryption key-management layer: fresh dynamic-KEK cluster initialization, eseRotateActiveKEK, and updateESECMKIdentifierList are never exercised under real SLS chaos. This ticket adds pali_chaos_kek.js alongside pali_chaos.js in src/mongo/db/modules/atlas/jstests/pali_chaos/. The new file reuses the full existing infrastructure (chaos_controller, chaos_workload, chaos_metrics, chaos_checks, chaos_report, all event types) and differs from pali_chaos.js in two respects only:

      • the cluster is started with PyKMIP-backed dynamic KEK instead of a static key file, and
      • a rotation driver thread fires KEK and CMK rotations concurrently with the running chaos phase.

      The existing disagg_storage/encryption JS tests cover rotation correctness in isolation; this test covers durability of rotation under the same faults that disaggregated storage faces in production.

      Work Items

      1. Create pali_chaos_kek.js in src/mongo/db/modules/atlas/jstests/pali_chaos/. Imports:
        • All existing pali_chaos modules (chaos_controller, chaos_workload, chaos_metrics, chaos_checks, chaos_report, chaos_config).
        • startPyKMIPServer, killPyKMIPServer, createPyKMIPKey, activatePyKMIPKey
        • SLSMinimalThreeCellTest and constants from sls_minimal_three_cell_test.js (same topology as pali_chaos.js, including the second log-server set for cross-zone seals).
      2. Start PyKMIP before cluster initialization. Create and activate a KMIP key. Record the UUID.
      3. Replace the encryptionKeyFilePath field in disaggConfig with KMIP mongod options (kmipPort, kmipServerName, kmipServerCAFile, kmipClientCertificateFile) and include initialDisaggESECMKIdentifierList as a setParameter. Mirror the patterns in SLSEncryptionTest.makeKmipMongodOptions() and makeStandardSetParameters(). Also set featureFlagKEKPushMode: true and featureFlagCMKRotation: true in setParameters so both rotation paths are exercised. Apply the same KMIP options to the standby mongod.
      4. Create a rotation driver thread (chaosRotationDriverFn) that runs concurrently with the chaos phase (from warmup start to stopLatch). The driver loop:
        • Sleep a random interval (rotationIntervalMinMs to rotationIntervalMaxMs, suggested defaults: 20 000 ms and 60 000 ms, exported from chaos_config.js).
        • Attempt eseRotateActiveKEK on the current primary. Tolerate transient errors (NotWritablePrimary, ConflictingOperationInProgress, NetworkError, etc.) and retry on the next interval; do not fail the run on a single failed attempt.
        • If the command succeeded, poll getESERotateActiveKEKStatus until the rotation reaches a terminal status (completed or failure) or the stopLatch is counted down, whichever comes first. Record the outcome (completed count, failure count) for post-chaos reporting.
        • After each KEK rotation attempt (successful or not), attempt updateESECMKIdentifierList with a two-entry CMK list (the original UUID plus a second key created at startup). Apply the same transient-error tolerance as the second step. Poll getESECMKIdentifierListStatus until terminal or stopLatch fires.
      5. The driver must track the primary port itself (re-discover via hello after failover) rather than relying on a fixed port, consistent with how the workload workers handle failover.
      6. Collect and join the rotation driver thread after the chaos phase ends (same pattern as the workload threads and lagThread). Surface KEK rotation counts (completed/failure/attempted) and CMK rotation counts in the PALI report by adding them to verdict.summary and printing them via printReport.
      7. Add post-chaos encryption validation using safelyRun, invoked after the existing dbHash check:
        • Connect to the current primary. Poll getESERotateActiveKEKStatus until the status is not pending (max 30 s). Fail with [MONGOD BUG] if it is still pending after the timeout.
        • Poll getESECMKIdentifierListStatus until rotationStatus.status is not pending (max 30 s). Fail with [MONGOD BUG] if still pending.
        • Write one document and read it back. Failure here is [DATA BUG] (encryption broken).
      8. Add a minimum-activity gate: if the rotation driver completed zero KEK rotations AND zero CMK rotations, push [FRAMEWORK] insufficient signal: rotation driver made no progress to verdict.failures. This prevents a silent pass when KMIP is unreachable for the entire run.
      9. Wrap PyKMIP teardown (killPyKMIPServer) in the cleanup block alongside sls.cleanup(). A throw from killPyKMIPServer should be caught and appended to cleanupErrors, not allowed to swallow the verdict assertion.
      10. Add pali_chaos_kek.js to src/mongo/db/modules/atlas/jstests/pali_chaos/BUILD.bazel (the all_javascript_files glob already picks up *.js, so no explicit entry is needed, but verify the glob covers it).
      11. Export two new constants from chaos_config.js:

            export const rotationIntervalMinMs = 20000;
            export const rotationIntervalMaxMs = 60000;

      Acceptance Criteria

      The acceptance criteria for merging this ticket are limited to test correctness. If a full run exposes server-side bugs, each bug should be filed as a separate JIRA ticket and the test should be merged regardless. The test must not be held back waiting for those bugs to be fixed; it is a diagnostic tool, and other team members need access to it to reproduce and track the failures.

      Required to merge

      • ali_chaos_kek.js starts and runs without framework errors: the cluster comes up with
          dynamic KEK, PyKMIP initializes, the rotation driver thread launches, and the workload
          threads begin producing operations.
      • The rotation driver fires at least one eseRotateActiveKEK attempt and one updateESECMKIdentifierList attempt during the chaos phase, confirming the driver loop is reachable (minimum-activity gate is not vacuous).
      • The minimum-activity gate fires a [FRAMEWORK] failure if the rotation driver is
          intentionally disabled (e.g., by setting rotationIntervalMinMs above chaosDurationMs),
          confirming the gate itself works correctly.
      • PyKMIP is cleanly torn down after the run even when the primary was killed at the end of
          the chaos phase (cleanup block does not throw past the verdict assertion).
      • The test can complete a full run (warmup + chaos + cooldown) without the test framework itself crashing (workload threads, lag monitor, and rotation driver all join cleanly).
      • The test can run manually in Evergreen.

      Expected behavior when the server has no bugs (not required to merge)

      • At least one KEK rotation and one CMK rotation reach completed status in verdict.summary.
      • When kill_primary_mongod fires during an in-progress KEK rotation, the rotation driver tolerates the transient error and the post-chaos polling check confirms the rotation reaches a terminal status after the new primary steps up.
      • When seal_and_kill_primary fires during an in-progress CMK rotation, the cluster recovers, the rotation driver resumes on the new primary, and the post-chaos CMK status poll succeeds.
      • Standard pali_chaos checks (validate, dbHash, acknowledged writes, standby lag) all pass, confirming the rotation driver does not corrupt the data integrity guarantees that pali_chaos.js already verifies.

      Notes

      PyKMIP is already used in the disagg_storage/encryption JS tests (tag: uses_pykmip). The new test should carry the same tags: [uses_pykmip, incompatible_with_s390x, incompatible_with_windows_tls, resource_intensive].

      This Jira does not include wiring the test to run automatically in Evergreen. To prevent unresolved Evergreen failures over a prolonged period, the test should be run manually until it is relatively stable and some bugs have been found and fixed. The follow-up ticket SERVER-129966 will do the automatic Evergreen wiring. To run the tests manually with this ticket:

      evergreen patch -p mongodb-mongo-dsc-release-master \
        --variants atlas-amazon2023-arm64 \
        --tasks disagg_pali_chaos_kek 

            Assignee:
            Chye Lin Chee
            Reporter:
            Chye Lin Chee
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: