[SERVER-76103] hang in transport_test / EgressAsioNetworkingBatonTest / AsyncOpsMakeProgressWhenSessionAddedToDetachedBaton waiting to reach failpoint Created: 14/Apr/23  Updated: 29/Oct/23  Resolved: 17/May/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.1.0-rc0, 7.0.0-rc6

Type: Bug Priority: Major - P3
Reporter: Billy Donahue Assignee: Billy Donahue
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Duplicate
is duplicated by SERVER-75876 HostUnreachable in transport_test / E... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v7.0
Sprint: Service Arch 2023-04-17, Service Arch 2023-05-01, Service Arch 2023-05-15, Service Arch 2023-05-29
Participants:
Linked BF Score: 7

 Description   

This causes asio_transport_layer_test.cpp cases to hang indefinitely.
The case of BFG-1827646 is one such case.
This is hard to repro as discussed in the closely related SERVER-75876.

While SERVER-75876 is about an unexpected HostUnreachable (in a prior test case in the same run) that was observed as a preluede to the hang, further investigation showed that the hang can occur without the HostUnreachable occurring, and can occur in single-test "--filter" run of AsyncOpsMakeProgressWhenSessionAddedToDetachedBaton.

Unfortunately the spawnhost (rhel81) machine's runtime environment seems to be capable of going through a full day of having a reliable repro for this and then without warning the same steps become a non-repro. Very odd.

After a discussion with Matt Diener about opportunisticRead, we may have some idea of how this can happen, and it may be a bug in that function, whereby the same input bytes can appear more than once in the data stream.
This would cause the deferred async read (and the associated failpoint) to never need to be executed in this unit test, which would explain the hang and timeouts.

If that's true and there's a real bug that can cause input data to be delivered twice to a Session, we need to investigate and fix that.



 Comments   
Comment by Githook User [ 24/Jun/23 ]

Author:

{'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com', 'username': 'BillyDonahue'}

Message: SERVER-76103 asio_transport_layer_test: keep sessions alive

(cherry picked from commit d8d82fa014c1a5690912ed3ceab1d72d4a973005)
Branch: v7.0
https://github.com/mongodb/mongo/commit/483fe588d3b8c710e168f9b2b94a33cc8d9f7ae9

Comment by Githook User [ 17/May/23 ]

Author:

{'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com', 'username': 'BillyDonahue'}

Message: SERVER-76103 asio_transport_layer_test: keep sessions alive
Branch: master
https://github.com/mongodb/mongo/commit/d8d82fa014c1a5690912ed3ceab1d72d4a973005

Comment by Billy Donahue [ 15/May/23 ]

The spawnhost on which I've been working this investigation has been destroyed by Evergreen after 30-days.
Further work will be done after recreating the workflow on another machine.

Comment by Billy Donahue [ 15/May/23 ]

The problem is AsioNetworkingBatonTest::FirstSessionSEP, which doesn't belong in this test.

The design of that class is to accept one connection and fill a promise with it.
In SERVER-71909 link, it was retrofitted to sometimes accept more connections, but insufficient consideration was given as to what it should actually do with those accepted connections, so it merely lets the connection fall out of scope and die.

        void startSession(std::shared_ptr<transport::Session> session) override {
            if (!_isEmplaced) {
                _promise.emplaceValue(std::move(session));
                _isEmplaced = true;
                return;
            }
            invariant(static_cast<bool>(shouldAllowMultipleSessions),
                      "Created a second ingress Session when only one was expected.");
        }

It takes time for this session to die, and it happens on the listener thread.

The test, meanwhile, is trying a polled asyncSourceMessage on the peer of this session immediately after it's set up.
The listener thread races with the JoinThread created inside the test.

    JoinThread thread([&] {
        auto baton = opCtx->getBaton();
        ready.get();
        es.session()->asyncSourceMessage(baton).ignoreValue().getAsync(
            [session = es.session()](Status) {
                // Capturing `session` is necessary as parts of this continuation run at fixture
                // destruction.
            });
    });

If this test's local thread reaches this asyncSourceMessage call quickly enough, the socket peer will not yet have been killed by the listener thread. There will be no events on the session's socket as it is polled, and opportunisticRead's initial asio::read of that socket will return a successful but empty result (here).

If the listener manages to run faster than the test and kill the incoming session quickly, then the session's peer, the test's local thread, will perform this initial asio::read after the socket is closed, and will therefore exit early and not reach the FailPoint upon which the test relies.

Comment by Billy Donahue [ 15/May/23 ]

Ok strace -k gives stack traces...

strace -f -k -s shutdown ....

[pid 19234] shutdown(23, SHUT_RDWR)     = 0
 > libc-2.28.so(shutdown+0xb) [0xfe0bb]
 > libasio.so(asio::detail::socket_ops::shutdown(int, int, std::error_code&)+0x52) [0xbee19]
 > libtransport_layer.so(asio::detail::reactive_socket_service<asio::generic::stream_protocol>::shutdown(asio::detail::reactive_socket_service_base::base_implementation_type&, asio::socket_base::shutdown_type, std::error_code&)+0x2d) [0x58f369]
 > libtransport_layer.so(asio::basic_socket<asio::generic::stream_protocol>::shutdown(asio::socket_base::shutdown_type, std::error_code&)+0x45) [0x587277]
 > libtransport_layer.so(mongo::transport::CommonAsioSession::end()+0x92) [0x563fd8]
 > libtransport_layer.so(mongo::transport::SyncAsioSession::~SyncAsioSession()+0x29) [0x57f8cf]
 > libtransport_layer.so(mongo::transport::SyncAsioSession::~SyncAsioSession()+0x17) [0x57f8f7]
 > libtransport_layer.so(std::_Sp_counted_ptr<mongo::transport::SyncAsioSession*, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x27) [0x65b809]
 > transport_test(std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()+0x9a) [0x2f2806]
 > transport_test(std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count()+0x26) [0x2e8fa6]
 > transport_test(std::__shared_ptr<mongo::transport::Session, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr()+0x1b) [0x2e443b]
 > transport_test(std::shared_ptr<mongo::transport::Session>::~shared_ptr()+0x17) [0x2e447d]
 > libtransport_layer.so(mongo::transport::AsioTransportLayer::_acceptConnection(asio::basic_socket_acceptor<asio::generic::stream_protocol>&)::{
     lambda(std::error_code const&, asio::basic_stream_socket<asio::generic::stream_protocol>)#1
     }::operator()(std::error_code const&, asio::basic_stream_socket<asio::generic::stream_protocol>)+0x4f0) [0x5f97dc]
 > libtransport_layer.so(asio::detail::move_binder2<mongo::transport::AsioTransportLayer::_acceptConnection(asio::basic_socket_acceptor<asio::generic::stream_protocol>&)::{
     lambda(std::error_code const&, asio::basic_stream_socket<asio::generic::stream_protocol>)#1
     }, std::error_code, asio::basic_stream_socket<asio::generic::stream_protocol> >::operator()()+0x50) [0x609fac]
 > libtransport_layer.so(void asio::asio_handler_invoke<asio::detail::move_binder2<mongo::transport::AsioTransportLayer::_acceptConnection(asio::basic_socket_acceptor<asio::generic::stream_protocol>&)::{
     lambda(std::error_code const&, asio::basic_stream_socket<asio::generic::stream_protocol>)#1
     }, std::error_code, asio::basic_stream_socket<asio::generic::stream_protocol> > >(asio::detail::move_binder2<mongo::transport::AsioTransportLayer::_acceptConnection(asio::basic_socket_acceptor<asio::generic::stream_protocol>&)::{
         lambda(std::error_code const&, asio::basic_stream_socket<asio::generic::stream_protocol>)#1
         }, std::error_code, asio::basic_stream_socket<asio::generic::stream_protocol> >&, ...)+0x6f) [0x60971f]

libtransport_layer.so(mongo::transport::AsioTransportLayer::_acceptConnection(asio::basic_socket_acceptor<asio::generic::stream_protocol>&)::{
     lambda(std::error_code const&, asio::basic_stream_socket<asio::generic::stream_protocol>)#1
     }::operator()(std::error_code const&, asio::basic_stream_socket<asio::generic::stream_protocol>)+0x4f0) [0x5f97dc]

The acceptCb lambda body inside _acceptConnection is unpinning a shared_ptr holding a mongo::transport::SyncAsioSession, and the destructor causes. a shutdown of the socket.

https://github.com/mongodb/mongo/blob/54ce9f3a48dcde0758e4cdf77d7b0efd08d71e2b/src/mongo/transport/asio/asio_transport_layer.cpp#L1500

Comment by Billy Donahue [ 15/May/23 ]

The controversial connection is fd 21, made from thread3 to the listener thread, which accepts it as fd 22.
After performing some initial socket option getters and setters, from both sides, we see the listener thread
side shutdown and close fd 22.

Filtered and annotated strace log excerpt.

[listener:31139] accept(9, NULL, NULL)      = 22
//  Connection from thread3's fd=21 accepted here as fd=22
[listener:31139] epoll_ctl(11, EPOLL_CTL_ADD, 22, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP|EPOLLET, {u32=3758416352, u64=140419124224480}})   = 0
[thread3:31141] epoll_wait(17, [{EPOLLOUT, {u32=3758417024, u64=140419124225152}}], 128, -1) = 1
[thread3:31141] poll([{fd=21, events=POLLOUT}], 1, 0)        = 1 ([{fd=21, revents=POLLOUT}])
// ... a few configs for different options.. all succeed. on both sides of this connection: fd 21 and 22.
[thread3:31141] ioctl(21, FIONBIO, [1])     = 0  // set to nonblocking.
// Orderly shutdown and close by listener of socket fd=22, which is the peer of thread3's fd=21 (port 45994).
[listener:31139] shutdown(22, SHUT_RDWR)    = 0
[listener:31139] close(22)                   = 0
 
[main:31134] clone(child_stack=0x7fb5dc44ba30, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fb5dc44c9d0, tls=0x7fb5dc44c700, child_tidptr=0x7fb5dc44c9d0) = 31143
    strace: Process 31143 attached
//  tid 31143  :   Yet another thread created by main there (not sure what it does yet), but thread3 is still dealing with the shutdown of fd 21. It seems to just wait on a futex immediately.
 
{"t":{"$date":"2023-05-11T17:25:51.402Z"},"s":"I",  "c":"TEST",     "id":0,       "ctx":"main","msg":"waiting for times entered=1"}
[main:31134] futex(0x7fffa49bd738, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1683825951, tv_nsec=502000000}, 0xffffffff
 <unfinished ...>
 
[thread3:31141] set_robust_list(0x7fb5dc44c9e0, 24) = 0
[thread3:31141] sigaltstack({ss_sp=0x7fb5dff99020, ss_flags=0, ss_size=65536}, NULL) = 0
 
// recvmsg: "The return value will be 0 when the peer has performed an orderly shutdown."
[thread3:31141] recvmsg(21, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="", iov_len=16}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 0
 
[thread3:31141] gettid()                    = 31143
[thread3:31141] getpid()                    = 31134
{"t":{"$date":"2023-05-11T17:25:51.403Z"},"s":"I",  "c":"NETWORK",  "id":0,       "ctx":"thread3","msg":"opportunisticRead initial sync read","attr":{"size":0,"error":"End of file"}}
{"t":{"$date":"2023-05-11T17:25:51.403Z"},"s":"I",  "c":"TEST",     "id":0,       "ctx":"thread3","msg":"session: getAsync {}","attr":{"error":{"code":6,"codeName":"HostUnreachable","errmsg":"futurize :: caused by :: Connection closed by peer"}}}
{"t":{"$date":"2023-05-11T17:25:51.404Z"},"s":"I",  "c":"TEST",     "id":0,       "ctx":"thread3","msg":"joinThread end of callback"}
[thread3:31141] sigaltstack({ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=0}, NULL) = 0
[thread3:31141] madvise(0x7fb5dbc4c000, 8368128, MADV_DONTNEED) = 0
[thread3:31141] exit(0)                     = ?

I don't yet know why the listener thread shuts the socket down.
Tracing that.

Comment by Billy Donahue [ 08/May/23 ]

Confirmed that rebooting the spawnhost resets it into a state in which the repro happens again!

Comment by Billy Donahue [ 20/Apr/23 ]

ah. Gdb tracing shows that the HostUnreachable is coming up in the hanged test, so this IS correlated to the unexpected exceptions in the original BF. It's the result of an asio::error::eof result from the first asio::read of the stream. Our errorCodeToStatus function translates error_code asio::error::eof to HostUnreachable for some reason I don't remember. I think it was chosen because it's retryable, not because the host was unreachable.

    } else if (ec == asio::error::eof) {
        return makeStatus(ErrorCodes::HostUnreachable, "Connection closed by peer");

Anyway, there's an asio::error::eof coming up through the asio::read first read of the socket for some reason.

Hypothesis time:

The fact that this happens on fresh hosts for a while and then can't be reproduced for a long time is weird.

I wonder if this is related to TCP fast open which has behavior that depends on outside-the-process kernel cached cookies.
But now there's a root cause and we can try to identify how the asio::read turned into an EOF result.

The eof has been traced to a recvmsg call (with flags=0) returning 0.

https://linux.die.net/man/2/recvmsg
"The return value will be 0 when the peer has performed an orderly shutdown."

So there's some race perhaps in this test's unusual structure causing this socket to be closed unexpectedly. More digging needed.

Comment by Billy Donahue [ 19/Apr/23 ]

https://buildbaron.corp.mongodb.com/ui/#/bfg/BFG-1827646

Generated at Thu Feb 08 06:31:50 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.