Fix stale-process misconnection in MongoRunner

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Fixed
    • Priority: Major - P3
    • 9.0.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • DevProd Test Infrastructure
    • Fully Compatible
    • DevProd Test Infra 2026-06-16, DevProd Test Infra 2026-06-30
    • 200
    • None
    • None
    • None
    • None
    • None
    • None
    • None

        1. Root cause

      A stale mongod from a previous test was still listening on a port when the next test tried to start a new mongod on the same port. The new process exited immediately with `EXIT_NET_ERROR` (48, "address already in use"), but `MongoRunner.awaitConnection` TCP-connected to the stale process, saw a successful handshake, and silently returned it as the "new" connection. The caller then ran operations against a mongod whose dbpath had already been wiped, producing `NoSuchKey` / `ENOENT` errors from an empty data directory.

      The stale process was possible because `shutdownTimeoutMillisForSignaledShutdown` is 100ms; under high CPU load a mongod from a previous test may not finish shutting down before the next test starts on the same port.

      This is a test infrastructure bug in `src/mongo/shell/servers.js` — `awaitConnection` is responsible for verifying it connected to the process it started, and it did not.

        1. Fix
          1. `src/mongo/shell/servers.js`

      `MongoRunner.awaitConnection` now reads `serverStatus.pid` after a successful TCP connection and compares it to the pid of the process it started. A mismatch means the stale process answered; the function sets `serverExitCodeMap[port] = EXIT_NET_ERROR` and throws `StopError(EXIT_NET_ERROR)` — the same error the caller sees when the new process fails to start.

      Note: `serverStatus.pid` and the pid returned by `_startMongoProgram` are both `NumberLong` objects, which don't compare correctly with `==`/`!=`. The fix uses unary `+` coercion (`+serverPid !== +pid`) for a correct numeric comparison.

          1. `jstests/noPassthrough/shell/mongorunner_stale_process_detection.js`

      New regression test: starts two mongods, then calls `awaitConnection` with one mongod's pid and the other's port, verifying that `StopError(EXIT_NET_ERROR)` is thrown instead of silently returning the wrong connection.

        1. Defensive hardening
          1. `buildscripts/s3_binary/download.py`

      Two bugs in the S3 download path produced misleading errors when a compile artifact was intermittently absent — obscuring the real stale-process failure in some runs.

      • *Bug 1*: `_fetch_remote_sha256_hash` called `download_from_s3_with_requests` without `raise_on_error=True`. A 403 response wrote the S3 XML error body to the temp file;`read_sha_file` returned `<?xml` as the expected hash, causing a spurious redownload attempt. Fix:pass `raise_on_error=True`.
      • *Bug 2*: When `--ignore-file-not-exist` was set and all downloads failed, `_download_and_verify` returned early but its caller still called `os.replace(tempfile_name, local_path)`, atomically creating a zero-byte `mongo-binaries.zst`. Extraction then failed with a confusing crash. Fix: make `_download_and_verify` return a `bool` and gate `os.replace` on it.

            Assignee:
            Steve McClure
            Reporter:
            Steve McClure
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: