-
Type:
Task
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Server Security
-
Server Security 2026-07-03
-
None
-
None
-
None
-
None
-
None
-
None
-
None
pali_chaos.js tests SLS infrastructure resilience (log leader kills, seals, zone outages, etc.) with a static KEK. It has no coverage of the encryption key-management layer: fresh dynamic-KEK cluster initialization, eseRotateActiveKEK, and updateESECMKIdentifierList are never exercised under real SLS chaos. This ticket adds pali_chaos_kek.js alongside pali_chaos.js in src/mongo/db/modules/atlas/jstests/pali_chaos/. The new file reuses the full existing infrastructure (chaos_controller, chaos_workload, chaos_metrics, chaos_checks, chaos_report, all event types) and differs from pali_chaos.js in two respects only:
- the cluster is started with PyKMIP-backed dynamic KEK instead of a static key file, and
- a rotation driver thread fires KEK and CMK rotations concurrently with the running chaos phase.
The existing disagg_storage/encryption JS tests cover rotation correctness in isolation; this test covers durability of rotation under the same faults that disaggregated storage faces in production.
Work Items
- Create pali_chaos_kek.js in src/mongo/db/modules/atlas/jstests/pali_chaos/. Imports:
- All existing pali_chaos modules (chaos_controller, chaos_workload, chaos_metrics, chaos_checks, chaos_report, chaos_config).
- startPyKMIPServer, killPyKMIPServer, createPyKMIPKey, activatePyKMIPKey
- SLSMinimalThreeCellTest and constants from sls_minimal_three_cell_test.js (same topology as pali_chaos.js, including the second log-server set for cross-zone seals).
- Start PyKMIP before cluster initialization. Create and activate a KMIP key. Record the UUID.
- Replace the encryptionKeyFilePath field in disaggConfig with KMIP mongod options (kmipPort, kmipServerName, kmipServerCAFile, kmipClientCertificateFile) and include initialDisaggESECMKIdentifierList as a setParameter. Mirror the patterns in SLSEncryptionTest.makeKmipMongodOptions() and makeStandardSetParameters(). Also set featureFlagKEKPushMode: true and featureFlagCMKRotation: true in setParameters so both rotation paths are exercised. Apply the same KMIP options to the standby mongod.
- Create a rotation driver thread (chaosRotationDriverFn) that runs concurrently with the chaos phase (from warmup start to stopLatch). The driver loop:
- Sleep a random interval (rotationIntervalMinMs to rotationIntervalMaxMs, suggested defaults: 20 000 ms and 60 000 ms, exported from chaos_config.js).
- Attempt eseRotateActiveKEK on the current primary. Tolerate transient errors (NotWritablePrimary, ConflictingOperationInProgress, NetworkError, etc.) and retry on the next interval; do not fail the run on a single failed attempt.
- If the command succeeded, poll getESERotateActiveKEKStatus until the rotation reaches a terminal status (completed or failure) or the stopLatch is counted down, whichever comes first. Record the outcome (completed count, failure count) for post-chaos reporting.
- After each KEK rotation attempt (successful or not), attempt updateESECMKIdentifierList with a two-entry CMK list (the original UUID plus a second key created at startup). Apply the same transient-error tolerance as the second step. Poll getESECMKIdentifierListStatus until terminal or stopLatch fires.
- The driver must track the primary port itself (re-discover via hello after failover) rather than relying on a fixed port, consistent with how the workload workers handle failover.
- Collect and join the rotation driver thread after the chaos phase ends (same pattern as the workload threads and lagThread). Surface KEK rotation counts (completed/failure/attempted) and CMK rotation counts in the PALI report by adding them to verdict.summary and printing them via printReport.
- Add post-chaos encryption validation using safelyRun, invoked after the existing dbHash check:
- Connect to the current primary. Poll getESERotateActiveKEKStatus until the status is not pending (max 30 s). Fail with [MONGOD BUG] if it is still pending after the timeout.
- Poll getESECMKIdentifierListStatus until rotationStatus.status is not pending (max 30 s). Fail with [MONGOD BUG] if still pending.
- Write one document and read it back. Failure here is [DATA BUG] (encryption broken).
- Add a minimum-activity gate: if the rotation driver completed zero KEK rotations AND zero CMK rotations, push [FRAMEWORK] insufficient signal: rotation driver made no progress to verdict.failures. This prevents a silent pass when KMIP is unreachable for the entire run.
- Wrap PyKMIP teardown (killPyKMIPServer) in the cleanup block alongside sls.cleanup(). A throw from killPyKMIPServer should be caught and appended to cleanupErrors, not allowed to swallow the verdict assertion.
- Add pali_chaos_kek.js to src/mongo/db/modules/atlas/jstests/pali_chaos/BUILD.bazel (the all_javascript_files glob already picks up *.js, so no explicit entry is needed, but verify the glob covers it).
- Export two new constants from chaos_config.js:
export const rotationIntervalMinMs = 20000;
export const rotationIntervalMaxMs = 60000;
Acceptance Criteria
The acceptance criteria for merging this ticket are limited to test correctness. If a full run exposes server-side bugs, each bug should be filed as a separate JIRA ticket and the test should be merged regardless. The test must not be held back waiting for those bugs to be fixed; it is a diagnostic tool, and other team members need access to it to reproduce and track the failures.
Required to merge
- ali_chaos_kek.js starts and runs without framework errors: the cluster comes up with
dynamic KEK, PyKMIP initializes, the rotation driver thread launches, and the workload
threads begin producing operations. - The rotation driver fires at least one eseRotateActiveKEK attempt and one updateESECMKIdentifierList attempt during the chaos phase, confirming the driver loop is reachable (minimum-activity gate is not vacuous).
- The minimum-activity gate fires a [FRAMEWORK] failure if the rotation driver is
intentionally disabled (e.g., by setting rotationIntervalMinMs above chaosDurationMs),
confirming the gate itself works correctly. - PyKMIP is cleanly torn down after the run even when the primary was killed at the end of
the chaos phase (cleanup block does not throw past the verdict assertion). - The test can complete a full run (warmup + chaos + cooldown) without the test framework itself crashing (workload threads, lag monitor, and rotation driver all join cleanly).
- The test can run manually in Evergreen.
Expected behavior when the server has no bugs (not required to merge)
- At least one KEK rotation and one CMK rotation reach completed status in verdict.summary.
- When kill_primary_mongod fires during an in-progress KEK rotation, the rotation driver tolerates the transient error and the post-chaos polling check confirms the rotation reaches a terminal status after the new primary steps up.
- When seal_and_kill_primary fires during an in-progress CMK rotation, the cluster recovers, the rotation driver resumes on the new primary, and the post-chaos CMK status poll succeeds.
- Standard pali_chaos checks (validate, dbHash, acknowledged writes, standby lag) all pass, confirming the rotation driver does not corrupt the data integrity guarantees that pali_chaos.js already verifies.
Notes
PyKMIP is already used in the disagg_storage/encryption JS tests (tag: uses_pykmip). The new test should carry the same tags: [uses_pykmip, incompatible_with_s390x, incompatible_with_windows_tls, resource_intensive].
This Jira does not include wiring the test to run automatically in Evergreen. To prevent unresolved Evergreen failures over a prolonged period, the test should be run manually until it is relatively stable and some bugs have been found and fixed. The follow-up ticket SERVER-129966 will do the automatic Evergreen wiring. To run the tests manually with this ticket:
evergreen patch -p mongodb-mongo-dsc-release-master \ --variants atlas-amazon2023-arm64 \ --tasks disagg_pali_chaos_kek
- is depended on by
-
SERVER-129965 Standby crashes with invariantWTOK when checkpoint pick-up installs a pre-rotation checkpoint after KEK rotation oplog entry is already applied
-
- Closed
-
-
SERVER-129966 Wire pali_chaos_kek.js to run automatically in Evergreen
-
- Needs Scheduling
-
- related to
-
SERVER-130020 Standby fatal abort when checkpoint pick-up encounters stale keystore LSN after KEK rotation
-
- Needs Scheduling
-
-
SERVER-130051 WiredTiger fatal abort on WT_NOTFOUND from __wt_btcur_remove on WiredTigerShared.wt_stable during checkpoint after kill_standby_page_materializer in disaggregated storage
-
- Needs Scheduling
-
-
SERVER-129815 Antithesis: test_composer coverage for dynamic KEK generation, KEK rotation, and CMK rotation
-
- In Progress
-