[SERVER-76103] hang in transport_test / EgressAsioNetworkingBatonTest / AsyncOpsMakeProgressWhenSessionAddedToDetachedBaton waiting to reach failpoint Created: 14/Apr/23 Updated: 29/Oct/23 Resolved: 17/May/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 7.1.0-rc0, 7.0.0-rc6 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Billy Donahue | Assignee: | Billy Donahue |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Backport Requested: |
v7.0
|
||||||||||||||||
| Sprint: | Service Arch 2023-04-17, Service Arch 2023-05-01, Service Arch 2023-05-15, Service Arch 2023-05-29 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Linked BF Score: | 7 | ||||||||||||||||
| Description |
|
This causes asio_transport_layer_test.cpp cases to hang indefinitely. While Unfortunately the spawnhost (rhel81) machine's runtime environment seems to be capable of going through a full day of having a reliable repro for this and then without warning the same steps become a non-repro. Very odd. After a discussion with Matt Diener about opportunisticRead, we may have some idea of how this can happen, and it may be a bug in that function, whereby the same input bytes can appear more than once in the data stream. If that's true and there's a real bug that can cause input data to be delivered twice to a Session, we need to investigate and fix that. |
| Comments |
| Comment by Githook User [ 24/Jun/23 ] | ||||||||||||||||||||||||||||||||||
|
Author: {'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com', 'username': 'BillyDonahue'}Message: (cherry picked from commit d8d82fa014c1a5690912ed3ceab1d72d4a973005) | ||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 17/May/23 ] | ||||||||||||||||||||||||||||||||||
|
Author: {'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com', 'username': 'BillyDonahue'}Message: | ||||||||||||||||||||||||||||||||||
| Comment by Billy Donahue [ 15/May/23 ] | ||||||||||||||||||||||||||||||||||
|
The spawnhost on which I've been working this investigation has been destroyed by Evergreen after 30-days. | ||||||||||||||||||||||||||||||||||
| Comment by Billy Donahue [ 15/May/23 ] | ||||||||||||||||||||||||||||||||||
|
The problem is AsioNetworkingBatonTest::FirstSessionSEP, which doesn't belong in this test. The design of that class is to accept one connection and fill a promise with it.
It takes time for this session to die, and it happens on the listener thread. The test, meanwhile, is trying a polled asyncSourceMessage on the peer of this session immediately after it's set up.
If this test's local thread reaches this asyncSourceMessage call quickly enough, the socket peer will not yet have been killed by the listener thread. There will be no events on the session's socket as it is polled, and opportunisticRead's initial asio::read of that socket will return a successful but empty result (here). If the listener manages to run faster than the test and kill the incoming session quickly, then the session's peer, the test's local thread, will perform this initial asio::read after the socket is closed, and will therefore exit early and not reach the FailPoint upon which the test relies. | ||||||||||||||||||||||||||||||||||
| Comment by Billy Donahue [ 15/May/23 ] | ||||||||||||||||||||||||||||||||||
|
Ok strace -k gives stack traces...
The acceptCb lambda body inside _acceptConnection is unpinning a shared_ptr holding a mongo::transport::SyncAsioSession, and the destructor causes. a shutdown of the socket. | ||||||||||||||||||||||||||||||||||
| Comment by Billy Donahue [ 15/May/23 ] | ||||||||||||||||||||||||||||||||||
|
The controversial connection is fd 21, made from thread3 to the listener thread, which accepts it as fd 22. Filtered and annotated strace log excerpt.
I don't yet know why the listener thread shuts the socket down. | ||||||||||||||||||||||||||||||||||
| Comment by Billy Donahue [ 08/May/23 ] | ||||||||||||||||||||||||||||||||||
|
Confirmed that rebooting the spawnhost resets it into a state in which the repro happens again! | ||||||||||||||||||||||||||||||||||
| Comment by Billy Donahue [ 20/Apr/23 ] | ||||||||||||||||||||||||||||||||||
|
ah. Gdb tracing shows that the HostUnreachable is coming up in the hanged test, so this IS correlated to the unexpected exceptions in the original BF. It's the result of an asio::error::eof result from the first asio::read of the stream. Our errorCodeToStatus function translates error_code asio::error::eof to HostUnreachable for some reason I don't remember. I think it was chosen because it's retryable, not because the host was unreachable.
Anyway, there's an asio::error::eof coming up through the asio::read first read of the socket for some reason. Hypothesis time: The fact that this happens on fresh hosts for a while and then can't be reproduced for a long time is weird. I wonder if this is related to TCP fast open which has behavior that depends on outside-the-process kernel cached cookies. The eof has been traced to a recvmsg call (with flags=0) returning 0. https://linux.die.net/man/2/recvmsg So there's some race perhaps in this test's unusual structure causing this socket to be closed unexpectedly. More digging needed. | ||||||||||||||||||||||||||||||||||
| Comment by Billy Donahue [ 19/Apr/23 ] | ||||||||||||||||||||||||||||||||||