-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Query Integration
-
ALL
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Problem
Tests that exercise $function and $where expressions fail in ASAN/TSAN sanitizer builds whenever the server uses the mozjs-wasm JavaScript engine. Two distinct failure modes have been identified.
Failure Mode 1 — ASAN: mongod restart takes >5 minutes under memory pressure
In rhel8-debug-aubsan builds, restarting a terminated mongod node inside a 5-node replica set takes over 4 minutes to produce its first log line — far exceeding the resmoke fixture await_ready() timeout of 300 seconds. The ContinuousStepdown hook then reports:
buildscripts.resmokelib.errors.ServerFailure: Failed to connect to mongod on port 21050 after 300 seconds \{noformat}
The fixture teardown hangs waiting for the unresponsive process and is eventually killed by the Evergreen task timeout (~7,000 seconds later).
_Key evidence:_
* Old node (PID 3360) shut down cleanly at \{{2026-05-29T17:51:35.233Z}}
* New node (PID 12683) produced no log output until \{{2026-05-29T17:55:44.931Z}} — a _4 min 9 sec_ gap before WiredTiger even opened
* During this window, 488 "Waiting to connect to mongod on port 21050" attempts were logged
* The same binary starts in ~22 seconds when it is the only mongod on the machine; the slowdown only occurs when 4 other ASAN-instrumented mongods are already running and consuming available RAM
* The delay precedes \{{ScriptEngine::setup()}} entirely — it occurs in the OS loader / ASAN shadow-memory initialization phase for \{{libwasmtime_engine.so}}
* ASAN is configured with \{{check_initialization_order=true:strict_init_order=true}}, which adds per-static-global instrumentation that interacts badly with Wasmtime's many Rust statics (rayon thread pool, JIT state, etc.)
_Observed in:_ burn_in runs of \{{return_bson_scalar_from_js_function.js}} in \{{replica_sets_reconfig_terminate_primary_jscore_passthrough_priority_ports}}.
h3. Failure Mode 2 — TSAN: internal CHECK failure aborts mongod during $function execution
In \{{enterprise-rhel8-arm64-debug-tsan}} builds, executing a \{{$function}} aggregation stage triggers a TSAN-internal assertion failure:
ThreadSanitizer: CHECK failed: tsan_interceptors_posix.cpp:2079 "((thr->slot)) != (0)" (0x0, 0x0) (tid=29702) {noformat}
With TSAN_OPTIONS=abort_on_error=1:halt_on_error=1, this immediately aborts mongod. The mongo shell receives HostUnreachable: Connection closed by peer and the test assertion fails. The fixture teardown subsequently hangs for hours waiting for the crashed mongod.
This failure is related to SERVER-115422, which added race:wasmtime to etc/tsan.suppressions to suppress data-race false positives from Wasmtime's rayon thread pool. That suppression covers ThreadSanitizer: DATA RACE reports but not ThreadSanitizer: CHECK failed crashes, which happen at a lower level before race detection runs.
Observed in: burn_in runs of expression_function.js in aggregation_unsplittable_collections_on_random_shard_passthrough.
Root Cause
Both failures stem from the mozjs-wasm JS engine's use of the Wasmtime Rust runtime (libwasmtime_engine.so):
- Wasmtime starts a rayon work-stealing thread pool during initialization. Rayon threads are created via Rust's std::thread::spawn, which does not always go through TSAN's pthread_create interception, leaving those threads without TSan metadata (hence the thr->slot == 0 CHECK failure when TSan later tries to inspect them).
- Wasmtime's JIT-compiled code pages and WASM linear memory require significant ASAN shadow memory. Under memory pressure from 4+ concurrent ASAN-instrumented mongod processes, allocating and initializing this shadow memory is extremely slow (>4 minutes).
- Additionally, check_initialization_order=true instruments every access to Wasmtime's many Rust static globals, adding substantial per-access overhead at startup.
Impact
Both return_bson_scalar_from_js_function.js and expression_function.js were modified on calvin.nguyen/SERVER-116052 as part of adding $function support to the WASM engine. The burn_in system selected both for sanitizer validation, where they now fail consistently. Fixture teardown hangs consume 6,000–8,500 seconds of Evergreen task time per run.
Fix
Short-term (unblocks CI immediately)
Re-add the mozjs_wasm_unsupported tag to both affected tests. This tag causes resmoke to skip the test when the server reports javascriptEngine: "mozjs-wasm", which is always the case in ASAN/TSAN builds that enable the WASM engine.
Done in: calvin.nguyen/SERVER-116052 — commit pending.
Long-term
- TSAN: Investigate whether a fun:__sanitizer_on_thread_start or race:rayon suppression in etc/tsan.suppressions covers the CHECK failure. Alternatively, configure Wasmtime to use a thread-spawn callback so TSAN can properly register rayon threads.
- ASAN: Add an entry to etc/asan.denylist for Wasmtime, analogous to the existing grpc suppression for static-init-order issues (src:src/third_party/grpc/dist/src/cpp/util/status.cc). This will reduce shadow-memory and instrumentation overhead during startup.
- Both: Investigate sharing a single WasmEngineContext (Wasmtime Engine + deserialized .cwasm Component) per process rather than recreating it on every scope reset(). The current design calls Component::deserialize() on every JS operation reset, adding unnecessary overhead. The epoch-deadline mechanism in MozJSWasmBridge can be fixed to use current_epoch + 1 per store rather than always 1, making engine reuse safe across kill() calls.
References
SERVER-115422— race:wasmtime TSAN suppression (rayon data-race false positives)- etc/tsan.suppressions line 74: {
Unknown macro: {race}
}
- etc/asan.denylist — existing static-init-order suppression for grpc
- is related to
-
SERVER-115422 Implement the MozJS wrapper and exported interface
-
- Closed
-
- related to
-
SERVER-116052 Add support for $function
-
- Closed
-