[SERVER-61490] transport_layer_test: asio connect race Created: 15/Nov/21 Updated: 29/Oct/23 Resolved: 19/Nov/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 5.2.0, 5.0.7 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Billy Donahue | Assignee: | Billy Donahue |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Backport Requested: |
v5.0
|
||||||||||||||||
| Steps To Reproduce: | seems to happen on Evergreen RHEL8.0. Not reliably repro'd though. |
||||||||||||||||
| Sprint: | Service Arch 2021-11-22 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Linked BF Score: | 162 | ||||||||||||||||
| Description |
|
This is difficult to reproduce in a spawnhost but EVG master builds seem to find it regularly. |
| Comments |
| Comment by Githook User [ 18/Mar/22 ] |
|
Author: {'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com', 'username': 'BillyDonahue'}Message: (cherry picked from commit 0cae914207ee666c9c3e1d87dd8492f2749b3bf7) |
| Comment by Githook User [ 19/Nov/21 ] |
|
Author: {'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com', 'username': 'BillyDonahue'}Message: |
| Comment by Billy Donahue [ 18/Nov/21 ] |
|
Under TFO, the client doesn't send any packets at all to the server until it has data to write. |
| Comment by Billy Donahue [ 18/Nov/21 ] |
|
Code Review: https://github.com/10gen/mongo/pull/1984 |
| Comment by Billy Donahue [ 16/Nov/21 ] |
|
A-friggin-HA! I'll bet it's related to the TCP_FASTOPEN_CONNECT support and an option set by transport_layer_asio set on the egress sockets. Under TFO, a connect call might be considered complete once the SYN is sent (it doesn't wait for a response from the peer). This would explain why it never happens on Mac, Windows, or some Linux versions. It also perhaps partially explains why other TransportLayerASIO tests don't run into this problem: they are all creating vanilla client sockets that don't have the TCP_FASTOPEN_CONNECT option set. The behavior of whether connect waits or not is dynamically determined, which explains perhaps why Sam's reboot caused his machine to stop reproducing the hang. The dynamics involve whether the client has an unexpired TFO "cookie" for the peer's IP address. If it has a cookie, it can send that along and optimize the connect handshake. If not, it has to wait for an ACK. Still working out the details... But this must be related to what's going on somehow. If the connect returns early, ASIOSession's ctor will hit the failpoint and block. The accept callback never happens, because accept only returns when the TCP state machine is ESTABLISHED, which means the client responded to SYN,ACK with an ACK. This never happens, because the client side doesn't react to the SYN,ACK for some reason. I don't know why that would be, but maybe it's related to the connecting thread being stuck at a failpoint, but this doesn't quite make sense to me as I thought this negotiation is not handled in user space. Gotta dig into that. |
| Comment by Billy Donahue [ 16/Nov/21 ] |
|
Bad news! Using an io_context::post to wait for the context to be up and running before trying to connect seems to have not worked. This test still hangs. Need a new hypothesis here! |
| Comment by Billy Donahue [ 15/Nov/21 ] |
|
Update on this. I Paired up with amirsaman.memaripour for a couple of hours. We're both using the same EXACT libc, kernel, and transport_layer static executable on our respective virtual workstations. His machine consistently shows the hang, and mine does not. We checked all ldd -dependencies with shasum to verify the matches. He shipped me his transport_layer static executable over GDrive. During the hang, Sam could run a netstat and see that the connection was in the SYN_SENT TCP state, which means the connect started but was unacknowledged by the server (which is local to the test and running on its own completely separate io_context and thread). The server's async_accept never happens. After rebooting his workstation, he was unable to repro! So the first thing I saw (above in the description) was that a spawnhost equivalent to a failing buildhost could not repro even when using the same executable as the failing EVG task. That's strange. But it's a different machine. But now it looks like even the same machine after a reboot (Sam's virtual workstation) can't repro. We really go from consistent failure to consistent success. It's one of the weirdest things ever. Clearly there's some race in the timing of this test's tla.connect call and the acceptor's asio socket's async_accept. Maybe the connect is happening "too early" and ASIO misses it somehow.
|