-
Type:
Task
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
DevProd Infrastructure
-
200
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Summary
setUnixDomainSocketPermissions() in socket_utils.cpp uses chmod(path, permissions) after bind() creates the Unix domain socket. On Amazon Linux 2023 arm64, the socket file created by bind() intermittently vanishes before chmod() can run, causing Fatal Assertion 40487 (ENOENT).
Root Cause
In src/mongo/transport/asio/asio_transport_layer.cpp, the Unix socket creation sequence is:
- unlink() any existing socket (~line 628)
- bind() to create the new socket (~line 690)
- chmod() via setUnixDomainSocketPermissions() (~line 697)
The socket file created by bind() disappears before chmod() executes — within the same single-threaded initandlisten sequence. This causes chmod() to fail with ENOENT and triggers Fatal Assertion 40487.
What's been ruled out:
- Cross-process race conditions on the same socket path (port ranges are separated by ~250 ports per parallel Resmoke job)
- Resmoke port reuse from previous tests
- systemd-tmpfiles-clean (tested 8000 iterations, 0 failures)
- tmp.mount tmpfs overlay (ConditionPathIsSymbolicLink prevents activation)
- SELinux (permissive mode on AL2023)
- Cross-job port collision
This points to a platform-level issue specific to AL2023 arm64: a transient directory entry visibility gap after bind() creates a Unix domain socket, likely in the XFS or VFS layer on kernel 6.1.x aarch64, possibly related to the /tmp -> /data/tmp symlink on XFS/NVMe ephemeral storage.
Recommended Fix
Use fchmod(fd, permissions) instead of chmod(path, permissions) in setUnixDomainSocketPermissions(). The socket file descriptor is already available from the bind() call. Using fchmod() on the fd operates on the inode directly, not the path, making it immune to any directory entry visibility issues. This is also the POSIX-recommended approach.
The same fix should be applied in the gRPC transport layer (grpc_transport_layer_impl.cpp ~line 262).
Affected Files
- src/mongo/util/net/socket_utils.cpp — setUnixDomainSocketPermissions() (~line 280)
- src/mongo/transport/asio/asio_transport_layer.cpp — callers (~line 703-706)
- src/mongo/transport/grpc/grpc_transport_layer_impl.cpp — caller (~line 262)
Environment Details
- Distro: amazon2023-arm64 variants only
- Kernel: 6.1.x aarch64 (also seen on 6.12.x)
- Filesystem: XFS on NVMe ephemeral storage, symlinked from /tmp
- Failure rate: Intermittent (<1% of test runs on AL2023 arm64)
Related Tickets
- BF-39769 — unix_socket.js fileExists() returns false
- BF-40680 — Fatal assertion 40487 in setUnixDomainSocketPermissions (Closed)
- BF-40981 — host_connection_string_validation.js failure (Waiting for bug fix on this ticket)
- related to
-
SERVER-112674 Open and listen for connections on maintenance port
-
- Closed
-