-
Type:
Task
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: None
-
DevProd Test Infrastructure
-
Fully Compatible
-
DevProd Test Infra 2026-06-16, DevProd Test Infra 2026-06-30
-
200
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
- Root cause
A stale mongod from a previous test was still listening on a port when the next test tried to start a new mongod on the same port. The new process exited immediately with `EXIT_NET_ERROR` (48, "address already in use"), but `MongoRunner.awaitConnection` TCP-connected to the stale process, saw a successful handshake, and silently returned it as the "new" connection. The caller then ran operations against a mongod whose dbpath had already been wiped, producing `NoSuchKey` / `ENOENT` errors from an empty data directory.
The stale process was possible because `shutdownTimeoutMillisForSignaledShutdown` is 100ms; under high CPU load a mongod from a previous test may not finish shutting down before the next test starts on the same port.
This is a test infrastructure bug in `src/mongo/shell/servers.js` — `awaitConnection` is responsible for verifying it connected to the process it started, and it did not.
-
- Fix
-
-
- `src/mongo/shell/servers.js`
-
`MongoRunner.awaitConnection` now reads `serverStatus.pid` after a successful TCP connection and compares it to the pid of the process it started. A mismatch means the stale process answered; the function sets `serverExitCodeMap[port] = EXIT_NET_ERROR` and throws `StopError(EXIT_NET_ERROR)` — the same error the caller sees when the new process fails to start.
Note: `serverStatus.pid` and the pid returned by `_startMongoProgram` are both `NumberLong` objects, which don't compare correctly with `==`/`!=`. The fix uses unary `+` coercion (`+serverPid !== +pid`) for a correct numeric comparison.
-
-
- `jstests/noPassthrough/shell/mongorunner_stale_process_detection.js`
-
New regression test: starts two mongods, then calls `awaitConnection` with one mongod's pid and the other's port, verifying that `StopError(EXIT_NET_ERROR)` is thrown instead of silently returning the wrong connection.
-
- Defensive hardening
-
-
- `buildscripts/s3_binary/download.py`
-
Two bugs in the S3 download path produced misleading errors when a compile artifact was intermittently absent — obscuring the real stale-process failure in some runs.
- *Bug 1*: `_fetch_remote_sha256_hash` called `download_from_s3_with_requests` without `raise_on_error=True`. A 403 response wrote the S3 XML error body to the temp file;`read_sha_file` returned `<?xml` as the expected hash, causing a spurious redownload attempt. Fix:pass `raise_on_error=True`.
- *Bug 2*: When `--ignore-file-not-exist` was set and all downloads failed, `_download_and_verify` returned early but its caller still called `os.replace(tempfile_name, local_path)`, atomically creating a zero-byte `mongo-binaries.zst`. Extraction then failed with a confusing crash. Fix: make `_download_and_verify` return a `bool` and gate `os.replace` on it.
- is related to
-
SERVER-126027 Allow setting retryability for IngressRequestRateLimitExceeded errors at runtime
-
- Closed
-
-
SERVER-128212 Reduce flakiness in auth:false IngressRequestRateLimiter jstests
-
- Needs Verification
-