[SERVER-73944] Log an explicit, searchable message that test harness is killing the test Created: 02/Feb/23  Updated: 29/Oct/23  Resolved: 23/Oct/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.2.0-rc0

Type: Improvement Priority: Minor - P4
Reporter: Kevin Cherkauer Assignee: Mikhail Shchatko
Resolution: Fixed Votes: 0
Labels: FY24Q1-request
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Server Development Platform
Backwards Compatibility: Fully Compatible
Sprint: DAG 2023-10-16, DAG 2023-10-30
Participants:
Story Points: 2

 Description   

Can the test harness please log an explicit and informative message that it is about to kill the test processes before doing so. The current behavior is not easily recognizable as being an intentional kill from the test harness and can easily be mistaken for a crash of the server.

Example from BF-27450:

Writing fatal message
message:
ExternalRecordStoreTest
NamedPipeMultiplePipes4
Writing fatal message
message: Got signal: 6 (Abort trap: 6).
mongo::stack_trace_detail::(anonymous namespace)::getStackTraceImpl(mongo::stack_trace_detail::(anonymous namespace)::Options const&)
mongo::printStackTrace()
abruptQuit
_sigtramp
__srefill1
__fread
fread
std::__1::basic_filebuf<char, std::__1::char_traits<char> >::underflow()
std::__1::basic_streambuf<char, std::__1::char_traits<char> >::uflow()
std::__1::basic_streambuf<char, std::__1::char_traits<char> >::xsgetn(char*, long)
std::__1::basic_istream<char, std::__1::char_traits<char> >::read(char*, long)
mongo::NamedPipeInput::doRead(char*, int)
mongo::InputStream<mongo::NamedPipeInput>::readBytes(int, char*)
mongo::MultiBsonStreamCursor::nextFromCurrentStream()
mongo::MultiBsonStreamCursor::next()
mongo::UnitTest_SuiteNameExternalRecordStoreTestTestNameNamedPipeMultiplePipes4::_doTest()
mongo::unittest::Test::run()

I am told that I am not the only person who has been fooled by this. This looked to me like the reason the test timed out was because the server had crashed and stack dumped and therefore stopped making progress, but the reality was that the server had stopped making progress an hour ago and then the test harness sent "kill -6" to abort the test.

A message something like the following would be helpful to avoid time wasted investigating the wrong thing, and also make it easier for humans using Parsley to find the failure (currently timed out tests do not have the word "timeout" anywhere in the logs):

TEST TIMEOUT FAILURE ABORT: Aborting the test via "kill -6" because it has not made progress for one hour. 

 



 Comments   
Comment by Githook User [ 20/Oct/23 ]

Author:

{'name': 'Mikhail Shchatko', 'email': 'mikhail.shchatko@mongodb.com', 'username': 'MikhailShchatko'}

Message: SERVER-73944 Log that processes are killed by hang analyzer in test logs
Branch: master
https://github.com/mongodb/mongo/commit/79985637361f57882ae49f448465bf9737fc6a17

Comment by Trevor Guidry [ 19/Oct/23 ]

mikhail.shchatko@mongodb.com and I synced on slack and agreed on a solution where we think we can add a timeout message when the timeout happens.

Comment by Kevin Cherkauer [ 18/Oct/23 ]

steve.gross@mongodb.com The test failure is already marked as "timeout." The goal is to be able to distinguish whether the apparent crash in the logs was a real server crash or just the test harness killing the server because it got tired of waiting for the test to finish, i.e.

  1. Did the server crash of its own accord, and eventually when the test did not finish after a set period the test harness then marked it as timed out?
  2. Or did the test just keep running and running, and after a set period the test harness killed the server and marked it as timed out?

#1 is a server crash and #2 is either a server hang or a bad performance regression. In the case of #1 the test harness probably also attempts to kill the server, but this second signal won't be logged by the server since the server was already dead from the real crash. Thus both cases show only one signal in the server log, making these cases hard to distinguish today.

Comment by Mikhail Shchatko [ 18/Oct/23 ]

trevor.guidry@mongodb.com I think anywhere in report.py log messages will be added before or after the test, but not in the middle of the test and exact time when hang analyzer is running. I think having this message after hang analyzer has already done killing processes will have very little value and not worth adding. The request here is to have a message right before processes are killed, so that it would be easier to see in test logs that server crash didn't happen before evergreen timeout was hit.

Having timeout test status is probably helpful, but still the problem above remains.

Comment by Trevor Guidry [ 18/Oct/23 ]

mikhail.shchatko@mongodb.com I think all of that is correct.

 

When the hang analyzer is called it writes all of the reports to a file (this reports file is what evergreen uses) https://github.com/10gen/mongo/blob/1625d69ac69dfbb419e1114601a2c0e7ec64f9a5/buildscripts/resmokelib/sighandler.py#L64

 

It looks like we already have logic for detecting timeout status here.

We could maybe add a log line but I think a better solution is by setting the test status in evergreen to "timeout" instead of "failure". From the comment in the above code snippet it looks like that was the original intention but it is blocked on this ticket. This ticket was closed as won't fix with a bulk closed as of a year ago but I just reopened it.

https://jira.mongodb.org/browse/EVG-1536

If we want to add a log line to address the timeouts I think we could store test_logger in test_info here. In that same "timeout detection" section we could write to the stored test_logger from test_info somewhere in here.

 

I don't think this is super high priority so I wouldn't mind waiting on the evergreen ticket but it was made in 2017 so maybe we should not get our hopes up.

The task is marked as a timeout which should be a good hint to developers that stuff was cut off but I understand that it is confusing when there is not different status on the test between a regular failure.

Comment by Mikhail Shchatko [ 18/Oct/23 ]

If I'm not missing anything, this is what happens now when Evergreen timeout is hit:
1. Hang analyzer is called by Evergreen. It has it's own logger, which logs are posted to Evergreen task logs.
2. Hang analyzer signals resmoke.py process, that runs the test, which triggers resmoke's signal handler. Resmoke has its own logger and its logs are printed in Evergreen task logs. Signal handler is registered once before running resmoke suite and uses resmoke's logger, thus its logs are printed in Evergreen task logs as well. Before each test resmoke creates as new test logger so that their logs will appear in each individual Evergreen test logs.
3. Resmoke's signal handler runs another hang ananlyzer, so here is another hang analyzer logger and logs are printed in Evergreen task logs.
4. The latest hang analyzer in its turn kills or aborts processes spawned by resmoke.

So here is the problem. We have decent amount of messages about timeout and hang analyzer actions in Evergreen task logs. Examples:

But we don't have anything in Evergreen test logs. It would be very helpful to have something there, but there is no straight-forward way of doing so, or I don't see any.

We create test logger here: https://github.com/10gen/mongo/blob/73a62e2f1642ba1dc10f36f62d17b14803243935/buildscripts/resmokelib/testing/report.py#L128-L131
As far as I understand, this test logger is available only in these 2 functions: https://github.com/10gen/mongo/blob/73a62e2f1642ba1dc10f36f62d17b14803243935/buildscripts/resmokelib/testing/report.py#L108-L176
And here resmoke invokes hang analyzer that will kill processes: https://github.com/10gen/mongo/blob/1402dc3fd226bf7073b14ea1e0b8cdd7e62f8978/buildscripts/resmokelib/sighandler.py#L156C1-L158
We can easily add log messages before and after tests, but not at the time hang analyzer kills processes, since hang analyzer is not invoked from the test.

trevor.guidry@mongodb.com steve.gross@mongodb.com alex.neben@mongodb.com thoughts, suggestions?

Generated at Thu Feb 08 06:26:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.