-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Critical - P2
-
None
-
Affects Version/s: 6.17.0, 6.18.0, 6.19.0, 6.20.0, 7.0.0, 6.21.0, 7.1.0
-
Component/s: CMAP, Connection Layer
-
1
-
Not Needed
-
None
-
Not Needed
-
None
-
None
-
None
-
None
-
None
-
None
Customer report: https://jira.mongodb.org/browse/HELP-89527
Environment: AWS Lambda (Node 20.x, 15-minute timeout), MongoDB Data Federation (Atlas) via AWS PrivateLink. Long-running `$out` aggregation into S3.
Timelines
6.15 (6.16 would work the same):
- CommandStarted, aggregation
- Data Federation completes in ~6 min
- CommandSucceeded, aggregation
6.17:
- CommandStarted, aggregation
- Data Federation completes in ~6 min
- ...
- Lambda timeout (15 min)
What have changed (https://github.com/mongodb/node-mongodb-native/pull/4510/changes#diff-7cc25e5d3247913cdf1fe1e7788e951e4435bb1091d6b8aef135c4b171c7c997):
makeSocket() explicitly called setKeepAlive and setNoDelay on all sockets after creation:
socket.setKeepAlive(true, 300000);
socket.setTimeout(connectTimeoutMS);
socket.setNoDelay(noDelay);
After the change (6.17 and onwards):
result.keepAliveInitialDelay ??= 120000; result.keepAlive = true; result.noDelay = options.noDelay ?? true; socket.setTimeout(connectTimeoutMS); // both calls have been removed // socket.setKeepAlive(true, 300000); // socket.setNoDelay(noDelay);
tls.connect() does not applies keepAlive and keepAliveInitialDelay from constructor options - these options are silently ignored (https://github.com/nodejs/node/issues/62003).
It happened with AWS PrivateLink because it has a 350-second idle connection timeout (https://aws.amazon.com/blogs/networking-and-content-delivery/implementing-long-running-tcp-connections-within-vpc-networking/). Without keepAlive during the long-running server-side operation (6 minutes aggregation) there is no data flowing, so the NLB drops the TCP connection. The server sends data into dead connection, and the client socket has no way to know that connection is gone.
Fix: we should restore explicit setKeepAlive and setNoDelay calls in makeSocket() after socket creation.
socket.setKeepAlive(true, options.keepAliveInitialDelay ?? 120_000); socket.setTimeout(connectTimeoutMS); socket.setNoDelay(options.noDelay ?? true);
There is no easy way to create integration test for this (as NLB drops TCP connection), new unit test should verify that setKeepAlive and setNoDelay are called on sockets produced by makeSocket for both: TLS and non-TLS connections.