SIGSEGV (Exit Code 139) exactly 30s after start on AMD Zen 5 due to hardware Shadow Stacks (user_shstk) clashing with coroutines

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Blocker - P1
    • None
    • Affects Version/s: 8.2.3, 8.2.5
    • Component/s: None
    • None
    • Environment:
    • ALL
    • Hide
      1. Provision an AMD Zen 4/Zen 5 machine with user_shstk enabled in the kernel (e.g., Ubuntu 24.04/26.04, Kernel 6.8+). I'm using a Ryzen 395 with 6.19.0-6-generic
      2. Run mongod from either mongo or mongodb/mongodb-community-server:
        docker run --name mongo -d mongo:8.2.5
      3. Wait ~30-35 seconds.
      4. `docker inspect` will show the container Exited (139).

      I've only validated this issue on docker so far, if anyone else has a modern CPU and kernel they should re-validate from a native source build of mongod. I might test this later if I get some more spare time.

      Workaround:
      The crash can be prevented by passing an environment variable to disable Shadow Stacks in glibc before mongod boots:

      docker run -d \
        --env GLIBC_TUNABLES="glibc.cpu.hwcaps=-SHSTK" \
        mongo:8.2.5

      With -SHSTK applied, the container remains perfectly stable. (Note: using glibc.pthread.rseq=0 does not fix this issue, proving it is a CET/Shadow Stack fault, not a TCMalloc rseq fault).

       

      Note for docker maintainers

      The mongodb/mongodb-community-server image masks this exit code due to a flawed Python entrypoint wrapper returning 0, but the community mongo image correctly propagates the 139 exit code via its exec gosu implementation).

       

      Short term fix

      Update the official Dockerfiles to permanently inject ENV GLIBC_TUNABLES="glibc.cpu.hwcaps=-SHSTK" into the environment to SIGSEGV/CET on modern hardware.

      Long term fix

      Refactor MongoDB's C++ coroutine context-switching logic to be CET-compliant (utilizing incsspq / rstorssp instructions) so it does not trigger Shadow Stack hardware faults. Yep... easy huh

      Show
      Provision an AMD Zen 4/Zen 5 machine with user_shstk enabled in the kernel (e.g., Ubuntu 24.04/26.04, Kernel 6.8+). I'm using a Ryzen 395 with 6.19.0-6-generic Run mongod from either mongo or mongodb/mongodb-community-server: docker run --name mongo -d mongo:8.2.5 Wait ~30-35 seconds. `docker inspect` will show the container Exited (139). I've only validated this issue on docker so far, if anyone else has a modern CPU and kernel they should re-validate from a native source build of mongod. I might test this later if I get some more spare time. Workaround: The crash can be prevented by passing an environment variable to disable Shadow Stacks in glibc before mongod boots: docker run -d \   --env GLIBC_TUNABLES= "glibc.cpu.hwcaps=-SHSTK" \   mongo:8.2.5 With -SHSTK applied, the container remains perfectly stable. (Note: using glibc.pthread.rseq=0 does not fix this issue, proving it is a CET/Shadow Stack fault, not a TCMalloc rseq fault).   Note for docker maintainers The mongodb/mongodb-community-server image masks this exit code due to a flawed Python entrypoint wrapper returning 0 , but the community mongo image correctly propagates the 139 exit code via its exec gosu implementation).   Short term fix Update the official Dockerfiles to permanently inject ENV GLIBC_TUNABLES="glibc.cpu.hwcaps=-SHSTK" into the environment to SIGSEGV/CET on modern hardware. Long term fix Refactor MongoDB's C++ coroutine context-switching logic to be CET-compliant (utilizing incsspq / rstorssp instructions) so it does not trigger Shadow Stack hardware faults. Yep... easy huh
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      When running the mongodb/mongodb-community-server and mongo 8.x Docker images on AMD strix halo with a 6.19.0-6-generic kernel, the mongod process boots successfully but silently crashes with Exit Code 139 (SIGSEGV) exactly ~30 seconds later.

      There is no stack trace printed to the MongoDB logs, and the host dmesg / journalctl ring buffer is entirely empty. The process simply vanishes.

      The crash is caused by a hardware-enforced Control-flow Enforcement Technology (CET) trap, specifically AMD's Shadow Stacks (user_shstk).

      1. The host CPU supports user_shstk.
      1. The container's userland (Ubuntu 24.04) has a glibc that enables Shadow Stacks by default if the underlying hardware supports it.
      1. At the ~30-second mark, MongoDB spins up background threads (likely LogicalSessionCacheReap / LogicalSessionCacheRefresh).
      1. The C++ coroutines used by MongoDB perform a context switch (stack pivoting).
      1. The CPU detects the stack pointer jump, assumes it is a Return-Oriented Programming (ROP) attack, and executes a hardware-level SIGSEGV.
      1. Because the hardware locks the stack, MongoDB's internal C++ crash handler double-faults and fails to print a stack trace, resulting in a completely silent death.

            Assignee:
            Unassigned
            Reporter:
            Joe Bennett
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: