[SERVER-70974] Fix early-exits triggered when user specifies TCP Fast Open server parameters Created: 31/Oct/22  Updated: 30/Nov/23  Resolved: 01/Aug/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.1.0-rc0, 7.0.5, 6.0.13

Type: Bug Priority: Major - P3
Reporter: Blake Oler Assignee: Billy Donahue
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Problem/Incident
Assigned Teams:
Service Arch
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v7.0, v6.0
Sprint: Service Arch 2022-12-26, Service Arch 2022-11-28, Service Arch 2022-12-12, Service Arch 2023-01-09, Service Arch 2023-01-23, Service Arch 2023-02-06, Service Arch 2023-06-12, Service Arch 2023-06-26, Service Arch 2023-07-10, Service Arch 2023-07-24, Service Arch 2023-08-07
Participants:
Case:

 Description   

Background/Problem

As part of implementing TCP Fast Open support, we calculate a variable that indicates whether the user specified any TCP Fast Open server parameters.

We use this variable in some places to decide whether to emit logs about success/failure enabling TCP Fast Open. We also intend to use this to decide whether to fail from connection establishment.

However, we forget to check whether the given error code indicates success or failure – the consequence is that whenever a user specifies any TCP Fast Open parameter, whether to turn it on, turn it off, or change the queue size, we will always fail connection establishment.

In some cases, this can trigger a crash, as seen in the linked HELP ticket. In other cases, this just permanently stalls the server, because we never establish the listener thread that would accept connections.

Relatedly, we always enable TCP Fast Open, even if the user attempts to disable it via server parameter.

Solution

  • Only return early from connection establishment if the error code is a failure.
  • Add logic only to enable TCP Fast Open if either the user doesn't specify (implicit enablement) or if the user specifies to enable TCP Fast Open. Do not enable it if the user indicates not to enable TCP Fast Open.
  • Add tests to ensure that the behavior has been fixed.


 Comments   
Comment by Githook User [ 29/Nov/23 ]

Author:

{'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com', 'username': 'BillyDonahue'}

Message: SERVER-70974 Fix early-exits triggered when user specifies TFO
Branch: v7.0
https://github.com/mongodb/mongo/commit/4cfb9179e4e7241852c7c2d6dffffaa7d271ab24

Comment by Githook User [ 29/Nov/23 ]

Author:

{'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com', 'username': 'BillyDonahue'}

Message: SERVER-70974 Fix early-exits triggered when user specifies TCP Fast Open server parameters
Branch: v6.0
https://github.com/mongodb/mongo/commit/ac2ac3fef3d9d0d14b996180b19ec70f986d6115

Comment by Billy Donahue [ 07/Aug/23 ]

Going to backport only the minimal changes to the transport layer.

The main PR for this ticket was a VERY hard-won testability refactor and lots of platform juggling to get the tests to pass reliably, so I don't think it's good to backport all that extra stuff. The basic fix is just a few tweaks to a few if conditions.

Comment by Githook User [ 02/Aug/23 ]

Author:

{'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com', 'username': 'BillyDonahue'}

Message: SERVER-70974 Fix early-exits triggered when user specifies TCP Fast Open server parameters
Branch: minh.luu-no_compile_sys-perf
https://github.com/mongodb/mongo/commit/16cabd2354977e9e1d2cf4a6a89416cad240ee76

Comment by Githook User [ 01/Aug/23 ]

Author:

{'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com', 'username': 'BillyDonahue'}

Message: SERVER-70974 Fix early-exits triggered when user specifies TCP Fast Open server parameters
Branch: master
https://github.com/mongodb/mongo/commit/16cabd2354977e9e1d2cf4a6a89416cad240ee76

Comment by Billy Donahue [ 31/Jul/23 ]

merging origin/master in revealed some issues, since new tests disabled TFO the old-fashioned way (which doesn't work anymore). Took care of that.

Windows hits a segfault tearing down one of the new tests. Not sure what that's about.
Trying to revert some small details to fix this problem. Not sure what it is yet.
https://spruce.mongodb.com/task/mongodb_mongo_master_windows_compile_required_run_unittests_patch_71a4c41f6e33cabc882b3442df027490eda7a900_64c81a48d6d80a5ccb0288db_23_07_31_20_32_45/tests?execution=0&sortBy=STATUS&sortDir=ASC

Comment by Billy Donahue [ 18/Jul/23 ]

Still some evergreen troubles. Working on it.

Comment by Billy Donahue [ 14/Jul/23 ]

As long as that journey was, it was not the end.
Changing to nonblocking IO client connect calls didn't fix the problem.

The problem was ALSO a server-side problem.

The listen call always specifies a backlog depth. In the jstest (mongod) this is 128.
In the C++ unit test it's 0. Termporary fix:

@ src/mongo/transport/asio/asio_transport_layer.cpp:1163 @ void AsioTransportLayer::_runListener() noexcept {
    for (auto& acceptorRecord : _acceptorRecords) {
        asio::error_code ec;
-       acceptorRecord->acceptor.listen(serverGlobalParams.listenBacklog, ec);
+       int backlog = serverGlobalParams.listenBacklog;
+       if (backlog == 0)
+           backlog = SOMAXCONN;
+       acceptorRecord->acceptor.listen(backlog, ec);

I'm partially blaming the obscurity of IDL and server parameter initialization for this.
The default value of this listenBacklog is not defined in C++. It's a C++ expression hiding in yaml.

   'net.listenBacklog':
        description: 'Set socket listen backlog size'
        short_name: listenBacklog
        arg_vartype: Int
        default: { expr: 'SOMAXCONN' }

But it has a "fake" initial value of 0 at its C++ definition, for a reason I don't understand:

    int listenBacklog = 0;  // --listenBacklog, real default is SOMAXCONN

(what does "real default" mean?)

So somehow the mocked environment of the transport_test doesn't go through all the same initialization steps of fetching the REAL default from IDL, and sticks with the C++ struct default or whatever, so we end up with a listen backlog of 0. This is not an error but it disables TFO. This doesn't actually break anything UNTIL you try to use TFO. TFO requires backlog capacity because the connection isn't fully TCP OPEN until it's been replied to, and it isn't replied to until you sendmsg a response, so a TFO connection is using a listen backlog slot until we read from it and send a reply.

If there's no backlog capacity, TFO can't be engaged and the server will simply accept the connection earlier (before recvmsg or sendmsg happens) and reply to the SYN with a SYNACK acking none of the data in the TFO SYN packet.

This is a failed TFO attempt because listen backlog was full (0 actually).
02:53:36.888552 IP localhost.54482 > localhost.37097: Flags [S], seq 423420387:423420427, win 65495, options [mss 65495,sackOK,TS val 4149143320 ecr 0,nop,wscale 7,tfo  cookie ed87c9091b811d74,nop,nop], length 40
 
// Only acks 423420388, which is the SYN sequence number +1, a normal TCP open.
02:53:36.888562 IP localhost.37097 > localhost.54482: Flags [S.], seq 1804626240, ack 423420388, win 65483, options [mss 65495,sackOK,TS val 4149143312 ecr 4149143320,nop,wscale 7], length 0
 
// The 40 payload bytes are not acked and must be retransmitted here:
02:53:36.888568 IP localhost.54482 > localhost.37097: Flags [P.], seq 1:41, ack 1, win 512, options [nop,nop,TS val 4149143320 ecr 4149143312], length 40
02:53:36.888575 IP localhost.37097 > localhost.54482: Flags [.], ack 41, win 502, options [nop,nop,TS val 4149143320 ecr 4149143320], length 0

PHWEW!

Comment by Billy Donahue [ 13/Jul/23 ]

Was having a great deal of trouble testing TFO usage. Jstest did it, but C++ test didn't.

Strace logs:

tcp_fastopen_test.js performs a successful TFO connection.

13310 16:29:53 socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 14
13310 16:29:53 setsockopt(14, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0
13310 16:29:53 ioctl(14, FIONBIO, [1])  = 0
13310 16:29:53 connect(14, {sa_family=AF_INET, sin_port=htons(20040), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
13310 16:29:53 ioctl(14, FIONBIO, [0])  = 0
13310 16:29:53 getsockname(14, {sa_family=AF_INET, sin_port=htons(50038), sin_addr=inet_addr("127.0.0.1")}, [128->16]) = 0
13310 16:29:53 setsockopt(14, SOL_TCP, TCP_NODELAY, [1], 4) = 0
13310 16:29:53 setsockopt(14, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
13310 16:29:53 getsockopt(14, SOL_TCP, TCP_KEEPIDLE, [7200], [4]) = 0
13310 16:29:53 setsockopt(14, SOL_TCP, TCP_KEEPIDLE, [300], 4) = 0
13310 16:29:53 getsockopt(14, SOL_TCP, TCP_KEEPINTVL, [75], [4]) = 0
13310 16:29:53 setsockopt(14, SOL_TCP, TCP_KEEPINTVL, [1], 4) = 0
13310 16:29:53 getsockname(14, {sa_family=AF_INET, sin_port=htons(50038), sin_addr=inet_addr("127.0.0.1")}, [128->16]) = 0
13310 16:29:53 ioctl(14, FIONBIO, [0])  = 0
13310 16:29:53 sendmsg(14, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\r\1\0\0\21\0\0\0\0\0\0\0\335\7\0\0\1\0\0\0\0\364\0\0\0\20hello\0"..., iov_len=269}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL <unfinished ...>

asio_transport_layer_test performs a TCP connection, failing TFO:

4028  20:14:51 socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 16
4028  20:14:51 setsockopt(16, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0
4028  20:14:51 connect(16, {sa_family=AF_INET, sin_port=htons(43019), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
4028  20:14:51 sendmsg(16, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="(\0\0\0\0\0\0\0\0\0\0\0\335\7\0\0\1\0\0\0\0\17\0\0\0\20ping\0\1"..., iov_len=40}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 40

Finally seeing these side by side, the ioctl call to select nonblocking IO before the connect (but then deselecting nonblocking IO immediately after the connect) is a smoking gun.

Generated at Thu Feb 08 06:17:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.