[SERVER-75876] HostUnreachable in transport_test / EgressAsioNetworkingBatonTest / CancelAsyncOperationsInterruptsOngoingOperations Created: 08/Apr/23  Updated: 26/Apr/23  Resolved: 26/Apr/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Billy Donahue Assignee: Billy Donahue
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File spawnhosts_comparison.mp4    
Issue Links:
Duplicate
duplicates SERVER-76103 hang in transport_test / EgressAsioNe... Closed
Related
Operating System: ALL
Sprint: Service Arch 2023-04-17, Service Arch 2023-05-01
Participants:
Linked BF Score: 7

 Description   

EgressAsioNetworkingBatonTest
CancelAsyncOperationsInterruptsOngoingOperations

throws a HostUnreachable instead of the expected CallbackCanceled.

https://buildbaron.corp.mongodb.com/ui/#/bfg/BFG-1866837
"task-timed-out: run_unittests on enterprise-rhel-81-ppc64le-dynamic"

The test failed in about 6 seconds but transport_test became hung and could not reach an exit.

https://logkeeper2.build.10gen.cc/build/d0b074972226829a1394014bb7837932/test/17520ada6fbdab1e4d5f30553cde2ec3?raw=1

[cpp_unit_test:transport_test] {"t":{"$date":"2023-04-02T06:39:11.388Z"},"s":"I",  "c":"TEST",     "id":23059,   "ctx":"main","msg":"Running","attr":{"test":"CancelAsyncOperationsInterruptsOngoingOperations","rep":1,"reps":1}}
[cpp_unit_test:transport_test] {"t":{"$date":"2023-04-02T06:39:11.403Z"},"s":"I",  "c":"NETWORK",  "id":23015,   "ctx":"listener","msg":"Listening on","attr":{"address":"127.0.0.1"}}
[cpp_unit_test:transport_test] {"t":{"$date":"2023-04-02T06:39:11.403Z"},"s":"I",  "c":"NETWORK",  "id":23016,   "ctx":"listener","msg":"Waiting for connections","attr":{"port":41519,"ssl":"off"}}
[cpp_unit_test:transport_test] {"t":{"$date":"2023-04-02T06:39:11.404Z"},"s":"I",  "c":"TEST",     "id":6109502, "ctx":"thread34","msg":"connected","attr":{"port":41519}}
[cpp_unit_test:transport_test] {"t":{"$date":"2023-04-02T06:39:11.406Z"},"s":"W",  "c":"CONTROL",  "id":23830,   "ctx":"thread35","msg":"Set failpoint","attr":{"failPointName":"asioTransportLayerBlockBeforeOpportunisticRead","failPoint":{"mode":1,"data":{},"timesEntered":0}}}
[cpp_unit_test:transport_test] {"t":{"$date":"2023-04-02T06:39:11.507Z"},"s":"W",  "c":"CONTROL",  "id":23831,   "ctx":"thread35","msg":"Set failpoint","attr":{"failPointName":"asioTransportLayerBlockBeforeOpportunisticRead","failPoint":{"mode":0,"data":{},"timesEntered":1}}}
[cpp_unit_test:transport_test] {"t":{"$date":"2023-04-02T06:39:11.692Z"},"s":"E",  "c":"TEST",     "id":23070,   "ctx":"main","msg":"Throwing exception","attr":{"exception":"Expected ex.toStatus().code() == ErrorCodes::CallbackCanceled (HostUnreachable == CallbackCanceled) @src/mongo/transport/asio/asio_transport_layer_test.cpp:1282"}}
[cpp_unit_test:transport_test] {"t":{"$date":"2023-04-02T06:39:11.693Z"},"s":"I",  "c":"TEST",     "id":6109500, "ctx":"main","msg":"connection: Tx stop request"}
[cpp_unit_test:transport_test] {"t":{"$date":"2023-04-02T06:39:11.693Z"},"s":"I",  "c":"TEST",     "id":6109503, "ctx":"thread34","msg":"connection: Rx stop request"}
[cpp_unit_test:transport_test] {"t":{"$date":"2023-04-02T06:39:11.693Z"},"s":"I",  "c":"TEST",     "id":6109501, "ctx":"main","msg":"connection: joined"}

matt.diener@mongodb.com



 Comments   
Comment by Billy Donahue [ 26/Apr/23 ]

Attaching for posterity a screencap showing the difference between two spawnhosts running the same exact binary.

spawnhosts_comparison.mp4

These results are solidly repeatable.

The host on the left (red border) has been running longer. It's been used to compile a transport_test binary from source, but that's not the binary we're executing here, and that was 24 hours ago. It succeeds every time!

The host on the right (orange border) is BRAND NEW, and has never done anything other than being populated with the artifacts of BFG-1827646.
It hangs every time!

Comment by Billy Donahue [ 26/Apr/23 ]


The hang of SERVER-76103 and the unexpected HostUnreachable exception here both have the same underlying cause: EOF of a socket that's being read. Consolidating these tickets.

Comment by Billy Donahue [ 26/Apr/23 ]

Ok I had a reliable repro for several days. Rock solid.

Then I wanted to try out a code change and rebuilt transport_test from the repo snapshot that the spawnhost imported, so it's the same exact code plus a small change of a connect timeout. After that rebuild, the rebuilt test executable could not repro the issue.
And once again, even the original executable from the BFG could not repro it either.

So the crazy scenario from the previous comment seems to be repeatable on at least 2 spawnhosts.
It seems that compiling the test executable changes the host in such a way that it can't repro the issue.

THIS IS EXTREMELY ODD.

I will have to try again to reason through this BF from staring at the source. SOMETHING must be wrong somewhere.

Comment by Billy Donahue [ 12/Apr/23 ]

Interesting.

Several hours ago this was a 100% repro on a spawnhost generated from the evergreen page for a failing BFG-1827646 task..

./transport_test \
    --suite=EgressAsioNetworkingBatonTest \
    --filter=AsyncOpsMakeProgressWhenSessionAddedToDetachedBaton

I wanted to build my own transport_test ON THE spawnhost and repro this with a custom built (but theoretically identical) test binary.
That would provide me with a starting point to add instrumentation.

But by the time I got this built and ready to run, it was passing 100%. This is an interesting clue, but unfortuntaly now the original repro with the original task's binaries was ALSO now passing 100%, after formerly being a reliable repro.

So I am forced back into looking at core dumps and trying perhaps on yet another spawnhost to repro this. Maybe something about building my test binary modified the machine in such a way that the repro stopped being a repro? Really confusing.

Comment by Billy Donahue [ 10/Apr/23 ]

Better BFG to study this problem...

BFG-1827646
https://buildbaron.corp.mongodb.com/ui/#/bfg/BFG-1827646

task-timed-out: run_unittests on enterprise-rhel80-unoptimized-64-bit [mongodb-mongo-master @ 513497c6] (transport_test)

Again 5 seconds to fail.

EgressAsioNetworkingBatonTest

.CancelAsyncOperationsInterruptsOngoingOperations

dies with host unreachable unexpected exception.

.AsyncOpsMakeProgressWhenSessionAddedToDetachedBaton then hangs.

https://parsley.mongodb.com/resmoke/81dcf3d0f11ac302af659d970c7ea216/test/17479846d765d5364d5f303bf7093400?bookmarks=0,181,186,191&filters=100%28failed%20to%20load%7Cinvariant%7Cfassert%7Cinvalid%20access%7C%22BACKTRACE%22%7C%5Cd%20%20threads%3F%20with%20tid%7CcheckReplicatedDataHa%7Cthe%20following%7CTripwire%20assertion%7Cthrew%20exception%7CTraceback%29&shareLine=0

can't repro on virtual workstation, but can repro on a spawnhost recreating the rhel80 environment.

Repro on spawnhost does NOT require the previous test to have failed.
This invocation runs ONE test in the suite and hangs on it.

$ ./transport_test \
    --suite=EgressAsioNetworkingBatonTest \
    --filter=AsyncOpsMakeProgressWhenSessionAddedToDetachedBaton

I believe the HostUnreachable exception was a red herring.

Comment by Billy Donahue [ 10/Apr/23 ]

BFG-1866837

task-timed-out: run_unittests on enterprise-rhel-81-ppc64le-dynamic [mongodb-mongo-master-nightly @ 128fe164] (transport_test)

EgressAsioNetworkingBatonTest.
AsyncOpsMakeProgressWhenSessionAddedToDetachedBaton

hang in tranport_test leading to 2 hour timeout of task.

https://parsley.mongodb.com/evergreen/mongodb_mongo_master_nightly_enterprise_rhel_81_ppc64le_dynamic_run_unittests_128fe164fb93ef1233158019e59f43570f6e0df0_23_04_01_23_02_49/0/task?bookmarks=0,699,908,1050,1860

Happens after failing with a HostUnreachable from an earlier test:

EgressAsioNetworkingBatonTest.
CancelAsyncOperationsInterruptsOngoingOperations

So CancelAsyncOperationsInterruptsOngoingOperations failed first with the exception, and then the test continued. The next test was AsyncOpsMakeProgressWhenSessionAddedToDetachedBaton which hanged.

This is on ppc64le arch which is hard to work with. Looking for a more accessible BFG to repro.

Generated at Thu Feb 08 06:31:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.