[SERVER-61490] transport_layer_test: asio connect race Created: 15/Nov/21  Updated: 29/Oct/23  Resolved: 19/Nov/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 5.2.0, 5.0.7

Type: Bug Priority: Major - P3
Reporter: Billy Donahue Assignee: Billy Donahue
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Problem/Incident
is caused by SERVER-61016 Swallow connection reset-related erro... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.0
Steps To Reproduce:

seems to happen on Evergreen RHEL8.0. Not reliably repro'd though.

Sprint: Service Arch 2021-11-22
Participants:
Linked BF Score: 162

 Description   

SERVER-61016 introduced a new egress connection test. The connect call can hang at a failpoint somehow on some hosts.

This is difficult to reproduce in a spawnhost but EVG master builds seem to find it regularly.

e.g. https://evergreen.mongodb.com/task/mongodb_mongo_master_linux_64_debug_required_run_unittests_0d48c5f4d89b622af4efebee16b375378296a4d8_21_11_12_19_58_49



 Comments   
Comment by Githook User [ 18/Mar/22 ]

Author:

{'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com', 'username': 'BillyDonahue'}

Message: SERVER-61490 disable TCPFastOpen during egress test

(cherry picked from commit 0cae914207ee666c9c3e1d87dd8492f2749b3bf7)
Branch: v5.0
https://github.com/mongodb/mongo/commit/21a55485b4ce6bde1437a3638cc1385881a56abb

Comment by Githook User [ 19/Nov/21 ]

Author:

{'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com', 'username': 'BillyDonahue'}

Message: SERVER-61490 disable TCPFastOpen during egress test
Branch: master
https://github.com/mongodb/mongo/commit/0cae914207ee666c9c3e1d87dd8492f2749b3bf7

Comment by Billy Donahue [ 18/Nov/21 ]

Under TFO, the client doesn't send any packets at all to the server until it has data to write.
This is conditionally controlled by whether the client has a cookie for that server yet or not, which is something the TCP implementaion determines on its own, which is why it appears to be so flaky and unreproducible. So in that case there's nothing to wake up the server, and the test hangs. Disable TFO for this test.

Comment by Billy Donahue [ 18/Nov/21 ]

Code Review: https://github.com/10gen/mongo/pull/1984

Comment by Billy Donahue [ 16/Nov/21 ]

A-friggin-HA! I'll bet it's related to the TCP_FASTOPEN_CONNECT support and an option set by transport_layer_asio set on the egress sockets. Under TFO, a connect call might be considered complete once the SYN is sent (it doesn't wait for a response from the peer). This would explain why it never happens on Mac, Windows, or some Linux versions. It also perhaps partially explains why other TransportLayerASIO tests don't run into this problem: they are all creating vanilla client sockets that don't have the TCP_FASTOPEN_CONNECT option set.

The behavior of whether connect waits or not is dynamically determined, which explains perhaps why Sam's reboot caused his machine to stop reproducing the hang. The dynamics involve whether the client has an unexpired TFO "cookie" for the peer's IP address. If it has a cookie, it can send that along and optimize the connect handshake. If not, it has to wait for an ACK.

Still working out the details... But this must be related to what's going on somehow.

If the connect returns early, ASIOSession's ctor will hit the failpoint and block. The accept callback never happens, because accept only returns when the TCP state machine is ESTABLISHED, which means the client responded to SYN,ACK with an ACK. This never happens, because the client side doesn't react to the SYN,ACK for some reason. I don't know why that would be, but maybe it's related to the connecting thread being stuck at a failpoint, but this doesn't quite make sense to me as I thought this negotiation is not handled in user space. Gotta dig into that.

Comment by Billy Donahue [ 16/Nov/21 ]

Bad news! Using an io_context::post to wait for the context to be up and running before trying to connect seems to have not worked. This test still hangs.

https://spruce.mongodb.com/task/mongodb_mongo_master_linux_64_debug_required_run_unittests_patch_d9fcd9f124ece9ab0b3a3c46cb6d7052b7282dd2_6192ec9a3e8e8661000b5dbf_21_11_15_23_27_06/tests?execution=0&sortBy=STATUS&sortDir=ASC

Need a new hypothesis here!

Comment by Billy Donahue [ 15/Nov/21 ]

Update on this.

I Paired up with amirsaman.memaripour for a couple of hours. We're both using the same EXACT libc, kernel, and transport_layer static executable on our respective virtual workstations. His machine consistently shows the hang, and mine does not. We checked all ldd -dependencies with shasum to verify the matches. He shipped me his transport_layer static executable over GDrive.

During the hang, Sam could run a netstat and see that the connection was in the SYN_SENT TCP state, which means the connect started but was unacknowledged by the server (which is local to the test and running on its own completely separate io_context and thread). The server's async_accept never happens.

After rebooting his workstation, he was unable to repro!

So the first thing I saw (above in the description) was that a spawnhost equivalent to a failing buildhost could not repro even when using the same executable as the failing EVG task. That's strange. But it's a different machine. But now it looks like even the same machine after a reboot (Sam's virtual workstation) can't repro. We really go from consistent failure to consistent success. It's one of the weirdest things ever.

Clearly there's some race in the timing of this test's tla.connect call and the acceptor's asio socket's async_accept. Maybe the connect is happening "too early" and ASIO misses it somehow.
We came up with 2 experiments to try:

  • In the test, post a task to the io_context and wait for it to finish. Do this before trying to connect, so we know the async_accept is queued up and processing events.
  • Try rewriting the acceptor in POSIX socket API instead of ASIO. This is more work (supporting Windows, etc) but it might answer the question of whether the problem is ASIO or libc/Linux.
Generated at Thu Feb 08 05:52:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.