[SERVER-75876] HostUnreachable in transport_test / EgressAsioNetworkingBatonTest / CancelAsyncOperationsInterruptsOngoingOperations Created: 08/Apr/23 Updated: 26/Apr/23 Resolved: 26/Apr/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Billy Donahue | Assignee: | Billy Donahue |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Sprint: | Service Arch 2023-04-17, Service Arch 2023-05-01 | ||||||||||||
| Participants: | |||||||||||||
| Linked BF Score: | 7 | ||||||||||||
| Description |
|
EgressAsioNetworkingBatonTest throws a HostUnreachable instead of the expected CallbackCanceled. https://buildbaron.corp.mongodb.com/ui/#/bfg/BFG-1866837 The test failed in about 6 seconds but transport_test became hung and could not reach an exit.
|
| Comments |
| Comment by Billy Donahue [ 26/Apr/23 ] | |||
|
Attaching for posterity a screencap showing the difference between two spawnhosts running the same exact binary. These results are solidly repeatable. The host on the left (red border) has been running longer. It's been used to compile a transport_test binary from source, but that's not the binary we're executing here, and that was 24 hours ago. It succeeds every time! The host on the right (orange border) is BRAND NEW, and has never done anything other than being populated with the artifacts of BFG-1827646. | |||
| Comment by Billy Donahue [ 26/Apr/23 ] | |||
|
| |||
| Comment by Billy Donahue [ 26/Apr/23 ] | |||
|
Ok I had a reliable repro for several days. Rock solid. Then I wanted to try out a code change and rebuilt transport_test from the repo snapshot that the spawnhost imported, so it's the same exact code plus a small change of a connect timeout. After that rebuild, the rebuilt test executable could not repro the issue. So the crazy scenario from the previous comment seems to be repeatable on at least 2 spawnhosts. THIS IS EXTREMELY ODD. I will have to try again to reason through this BF from staring at the source. SOMETHING must be wrong somewhere. | |||
| Comment by Billy Donahue [ 12/Apr/23 ] | |||
|
Interesting. Several hours ago this was a 100% repro on a spawnhost generated from the evergreen page for a failing BFG-1827646 task..
I wanted to build my own transport_test ON THE spawnhost and repro this with a custom built (but theoretically identical) test binary. But by the time I got this built and ready to run, it was passing 100%. This is an interesting clue, but unfortuntaly now the original repro with the original task's binaries was ALSO now passing 100%, after formerly being a reliable repro. So I am forced back into looking at core dumps and trying perhaps on yet another spawnhost to repro this. Maybe something about building my test binary modified the machine in such a way that the repro stopped being a repro? Really confusing. | |||
| Comment by Billy Donahue [ 10/Apr/23 ] | |||
|
Better BFG to study this problem... BFG-1827646 task-timed-out: run_unittests on enterprise-rhel80-unoptimized-64-bit [mongodb-mongo-master @ 513497c6] (transport_test) Again 5 seconds to fail. EgressAsioNetworkingBatonTest .CancelAsyncOperationsInterruptsOngoingOperations dies with host unreachable unexpected exception. .AsyncOpsMakeProgressWhenSessionAddedToDetachedBaton then hangs. can't repro on virtual workstation, but can repro on a spawnhost recreating the rhel80 environment. Repro on spawnhost does NOT require the previous test to have failed.
I believe the HostUnreachable exception was a red herring. | |||
| Comment by Billy Donahue [ 10/Apr/23 ] | |||
|
task-timed-out: run_unittests on enterprise-rhel-81-ppc64le-dynamic [mongodb-mongo-master-nightly @ 128fe164] (transport_test) EgressAsioNetworkingBatonTest. hang in tranport_test leading to 2 hour timeout of task. Happens after failing with a HostUnreachable from an earlier test: EgressAsioNetworkingBatonTest. So CancelAsyncOperationsInterruptsOngoingOperations failed first with the exception, and then the test continued. The next test was AsyncOpsMakeProgressWhenSessionAddedToDetachedBaton which hanged. This is on ppc64le arch which is hard to work with. Looking for a more accessible BFG to repro. |