-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Query Optimization
-
ALL
-
0
-
None
-
None
-
None
-
None
-
None
-
None
-
None
PrimaryDrivenResumableIndexBuildTest._readResumeMetrics reads each per-node OpenTelemetry metrics JSONL file with the shell cat() builtin, which has a hard 16 MiB cap. On slow variants (e.g. enterprise-rhel8-debug-tsan) the test runs long enough that the per-node metrics file grows past that cap, and the test fails during teardown verification with:
Error: cat() : file /data/db/job5/mongorunner/resumable_drain_phase_unclean_restart_otel_metrics_326978/mongodb-879239-20260607-metrics.jsonl too big to load as a variable (file is 18021853 bytes, limit is 16777216 bytes.)
Cause. The OTel periodic exporter writes a full cumulative snapshot of every counter to the JSONL on every flush. The default openTelemetryExportIntervalMillis is 1000 ms. A typical snapshot in this test is ~35 KiB, so ~510 s test wall-clock (observed on TSAN) × ~1 snapshot/s × ~35 KiB ≈ ~18 MiB per node. _readResumeMetrics only ever uses the latest JSON line in each file (the cumulative snapshot from that node), so the rest of the file is wasted I/O and bloat.
Repro. Run the suite with the TSAN command from BF-43763:
LANG=C TSAN_OPTIONS="abort_on_error=1:..." buildscripts/resmoke.py run \
--suites=no_passthrough_primary_driven_index_builds \
jstests/noPassthrough/index_builds/primary_driven/resumable_drain_phase_unclean_restart.js
Fix. In PrimaryDrivenResumableIndexBuildTest.setUp() in jstests/noPassthrough/libs/index_builds/primary_driven.js, set
openTelemetryExportIntervalMillis: 5000 on every replica-set node:
const rst = new ReplSetTest({ nodes, nodeOptions: {setParameter: {...otelParams, openTelemetryExportIntervalMillis: 5000}}, });
A 5 s flush cadence keeps each per-node JSONL well under the 16 MiB cap (~3.5 MiB on a 510 s TSAN run), and _verifyResumeMetric still has plenty of slack: its assert.soon polls for 30 s at 200 ms intervals, so a single resume gets ~6 flush windows in which the new cumulative snapshot lands on disk. _readResumeMetrics reads only the last fully-formed line per file, so a longer interval doesn't change what is verified — it just slows file growth.