[SERVER-73944] Log an explicit, searchable message that test harness is killing the test Created: 02/Feb/23 Updated: 29/Oct/23 Resolved: 23/Oct/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 7.2.0-rc0 |
| Type: | Improvement | Priority: | Minor - P4 |
| Reporter: | Kevin Cherkauer | Assignee: | Mikhail Shchatko |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | FY24Q1-request | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Assigned Teams: |
Server Development Platform
|
| Backwards Compatibility: | Fully Compatible |
| Sprint: | DAG 2023-10-16, DAG 2023-10-30 |
| Participants: | |
| Story Points: | 2 |
| Description |
|
Can the test harness please log an explicit and informative message that it is about to kill the test processes before doing so. The current behavior is not easily recognizable as being an intentional kill from the test harness and can easily be mistaken for a crash of the server. Example from BF-27450:
I am told that I am not the only person who has been fooled by this. This looked to me like the reason the test timed out was because the server had crashed and stack dumped and therefore stopped making progress, but the reality was that the server had stopped making progress an hour ago and then the test harness sent "kill -6" to abort the test. A message something like the following would be helpful to avoid time wasted investigating the wrong thing, and also make it easier for humans using Parsley to find the failure (currently timed out tests do not have the word "timeout" anywhere in the logs):
|
| Comments |
| Comment by Githook User [ 20/Oct/23 ] |
|
Author: {'name': 'Mikhail Shchatko', 'email': 'mikhail.shchatko@mongodb.com', 'username': 'MikhailShchatko'}Message: |
| Comment by Trevor Guidry [ 19/Oct/23 ] |
|
mikhail.shchatko@mongodb.com and I synced on slack and agreed on a solution where we think we can add a timeout message when the timeout happens. |
| Comment by Kevin Cherkauer [ 18/Oct/23 ] |
|
steve.gross@mongodb.com The test failure is already marked as "timeout." The goal is to be able to distinguish whether the apparent crash in the logs was a real server crash or just the test harness killing the server because it got tired of waiting for the test to finish, i.e.
#1 is a server crash and #2 is either a server hang or a bad performance regression. In the case of #1 the test harness probably also attempts to kill the server, but this second signal won't be logged by the server since the server was already dead from the real crash. Thus both cases show only one signal in the server log, making these cases hard to distinguish today. |
| Comment by Mikhail Shchatko [ 18/Oct/23 ] |
|
trevor.guidry@mongodb.com I think anywhere in report.py log messages will be added before or after the test, but not in the middle of the test and exact time when hang analyzer is running. I think having this message after hang analyzer has already done killing processes will have very little value and not worth adding. The request here is to have a message right before processes are killed, so that it would be easier to see in test logs that server crash didn't happen before evergreen timeout was hit. Having timeout test status is probably helpful, but still the problem above remains. |
| Comment by Trevor Guidry [ 18/Oct/23 ] |
|
mikhail.shchatko@mongodb.com I think all of that is correct.
When the hang analyzer is called it writes all of the reports to a file (this reports file is what evergreen uses) https://github.com/10gen/mongo/blob/1625d69ac69dfbb419e1114601a2c0e7ec64f9a5/buildscripts/resmokelib/sighandler.py#L64
It looks like we already have logic for detecting timeout status here. We could maybe add a log line but I think a better solution is by setting the test status in evergreen to "timeout" instead of "failure". From the comment in the above code snippet it looks like that was the original intention but it is blocked on this ticket. This ticket was closed as won't fix with a bulk closed as of a year ago but I just reopened it. https://jira.mongodb.org/browse/EVG-1536 If we want to add a log line to address the timeouts I think we could store test_logger in test_info here. In that same "timeout detection" section we could write to the stored test_logger from test_info somewhere in here.
I don't think this is super high priority so I wouldn't mind waiting on the evergreen ticket but it was made in 2017 so maybe we should not get our hopes up. The task is marked as a timeout which should be a good hint to developers that stuff was cut off but I understand that it is confusing when there is not different status on the test between a regular failure. |
| Comment by Mikhail Shchatko [ 18/Oct/23 ] |
|
If I'm not missing anything, this is what happens now when Evergreen timeout is hit: So here is the problem. We have decent amount of messages about timeout and hang analyzer actions in Evergreen task logs. Examples:
But we don't have anything in Evergreen test logs. It would be very helpful to have something there, but there is no straight-forward way of doing so, or I don't see any. We create test logger here: https://github.com/10gen/mongo/blob/73a62e2f1642ba1dc10f36f62d17b14803243935/buildscripts/resmokelib/testing/report.py#L128-L131 trevor.guidry@mongodb.com steve.gross@mongodb.com alex.neben@mongodb.com thoughts, suggestions? |