[SERVER-70974] Fix early-exits triggered when user specifies TCP Fast Open server parameters Created: 31/Oct/22 Updated: 30/Nov/23 Resolved: 01/Aug/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 7.1.0-rc0, 7.0.5, 6.0.13 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Blake Oler | Assignee: | Billy Donahue |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Service Arch
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Backport Requested: |
v7.0, v6.0
|
||||||||
| Sprint: | Service Arch 2022-12-26, Service Arch 2022-11-28, Service Arch 2022-12-12, Service Arch 2023-01-09, Service Arch 2023-01-23, Service Arch 2023-02-06, Service Arch 2023-06-12, Service Arch 2023-06-26, Service Arch 2023-07-10, Service Arch 2023-07-24, Service Arch 2023-08-07 | ||||||||
| Participants: | |||||||||
| Case: | (copied to CRM) | ||||||||
| Description |
Background/ProblemAs part of implementing TCP Fast Open support, we calculate a variable that indicates whether the user specified any TCP Fast Open server parameters. We use this variable in some places to decide whether to emit logs about success/failure enabling TCP Fast Open. We also intend to use this to decide whether to fail from connection establishment. However, we forget to check whether the given error code indicates success or failure – the consequence is that whenever a user specifies any TCP Fast Open parameter, whether to turn it on, turn it off, or change the queue size, we will always fail connection establishment. In some cases, this can trigger a crash, as seen in the linked HELP ticket. In other cases, this just permanently stalls the server, because we never establish the listener thread that would accept connections. Relatedly, we always enable TCP Fast Open, even if the user attempts to disable it via server parameter. Solution
|
| Comments |
| Comment by Githook User [ 29/Nov/23 ] | |||||||||||||||||||||||
|
Author: {'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com', 'username': 'BillyDonahue'}Message: | |||||||||||||||||||||||
| Comment by Githook User [ 29/Nov/23 ] | |||||||||||||||||||||||
|
Author: {'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com', 'username': 'BillyDonahue'}Message: | |||||||||||||||||||||||
| Comment by Billy Donahue [ 07/Aug/23 ] | |||||||||||||||||||||||
|
Going to backport only the minimal changes to the transport layer. The main PR for this ticket was a VERY hard-won testability refactor and lots of platform juggling to get the tests to pass reliably, so I don't think it's good to backport all that extra stuff. The basic fix is just a few tweaks to a few if conditions. | |||||||||||||||||||||||
| Comment by Githook User [ 02/Aug/23 ] | |||||||||||||||||||||||
|
Author: {'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com', 'username': 'BillyDonahue'}Message: | |||||||||||||||||||||||
| Comment by Githook User [ 01/Aug/23 ] | |||||||||||||||||||||||
|
Author: {'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com', 'username': 'BillyDonahue'}Message: | |||||||||||||||||||||||
| Comment by Billy Donahue [ 31/Jul/23 ] | |||||||||||||||||||||||
|
merging origin/master in revealed some issues, since new tests disabled TFO the old-fashioned way (which doesn't work anymore). Took care of that. Windows hits a segfault tearing down one of the new tests. Not sure what that's about. | |||||||||||||||||||||||
| Comment by Billy Donahue [ 18/Jul/23 ] | |||||||||||||||||||||||
|
Still some evergreen troubles. Working on it. | |||||||||||||||||||||||
| Comment by Billy Donahue [ 14/Jul/23 ] | |||||||||||||||||||||||
|
As long as that journey was, it was not the end. The problem was ALSO a server-side problem. The listen call always specifies a backlog depth. In the jstest (mongod) this is 128.
I'm partially blaming the obscurity of IDL and server parameter initialization for this.
But it has a "fake" initial value of 0 at its C++ definition, for a reason I don't understand:
(what does "real default" mean?) So somehow the mocked environment of the transport_test doesn't go through all the same initialization steps of fetching the REAL default from IDL, and sticks with the C++ struct default or whatever, so we end up with a listen backlog of 0. This is not an error but it disables TFO. This doesn't actually break anything UNTIL you try to use TFO. TFO requires backlog capacity because the connection isn't fully TCP OPEN until it's been replied to, and it isn't replied to until you sendmsg a response, so a TFO connection is using a listen backlog slot until we read from it and send a reply. If there's no backlog capacity, TFO can't be engaged and the server will simply accept the connection earlier (before recvmsg or sendmsg happens) and reply to the SYN with a SYNACK acking none of the data in the TFO SYN packet.
PHWEW! | |||||||||||||||||||||||
| Comment by Billy Donahue [ 13/Jul/23 ] | |||||||||||||||||||||||
|
Was having a great deal of trouble testing TFO usage. Jstest did it, but C++ test didn't. Strace logs: tcp_fastopen_test.js performs a successful TFO connection.
asio_transport_layer_test performs a TCP connection, failing TFO:
Finally seeing these side by side, the ioctl call to select nonblocking IO before the connect (but then deselecting nonblocking IO immediately after the connect) is a smoking gun. |