[SERVER-19895] resmoke failures should self-document Created: 12/Aug/15  Updated: 10/May/22  Resolved: 22/Nov/19

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Eric Milkie Assignee: Raiden Worley (Inactive)
Resolution: Done Votes: 0
Labels: tig-qwin-eligible, tig-resmoke
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Sprint: TIG E (01/08/16), STM 2019-11-14, STM 2019-11-28
Participants:
Story Points: 2

 Description   

When resmoke fails, it should print out steps to help the user debug the failure. E.g. when resmoke detects that it's run in Evergreen, it should print out the places that the user should look for symptoms.

Original description:
At shutdown time, a fixture (Replication or Sharding) checks the return status of all the process shutdown procedures. If any of them have returned 'false' (which means that the process returned a non-0 exit status), it fails the test suite associated with that fixture.
It would be helpful if the fixture wrote a message to the log stating which process caused the suite failure. Currently, the only way to diagnose this is to scour the logs looking for the exit status of each process; since we are looking for a line that is not "exited with code 0.", this is not a simple search to undertake.



 Comments   
Comment by Githook User [ 22/Nov/19 ]

Author:

{'name': 'Carl Worley', 'email': 'carl.worley@mongodb.com'}

Message: SERVER-19895 resmoke failures should self-document
Branch: master
https://github.com/mongodb/mongo/commit/6c803a670431185485786b1ac84df08cd50bca25

Comment by Raiden Worley (Inactive) [ 22/Nov/19 ]

CR: https://mongodbcr.appspot.com/513670005/

Comment by Robert Guo (Inactive) [ 12/Sep/19 ]

Excellent!

I updated the ticket to say that resmoke's error handling and reporting mechanism should gather and display debugging instructions in addition to the sometimes obscure error messages.

Comment by Eric Milkie [ 12/Sep/19 ]

Maybe the confusion is that when there are multiple red boxes, it isn't clear that one should look for teardown/setup failures first, then hook failures, then test failures? The teardown/setup failure box should replace the need to look for "error" in the task logs.

Yes, exactly! Perhaps all we need to do is document this better – this is all I need to keep in mind to help me diagnose these in the future.

Comment by Robert Guo (Inactive) [ 12/Sep/19 ]

I end up looking at a combination of all the "failed" test logs

The original request in this ticket was to make it easier to detect if a process exited abnormally. The solution was to add an item (jobx_fixture_teardown) to the sidebar that will turn red if a fixture crashed. This extra red box is meant to replace searching for "not 'exited with code 0'", which I believe it does now with a much more explicit error message: "An error occurred during the teardown..."

Maybe the confusion is that when there are multiple red boxes, it isn't clear that one should look for teardown/setup failures first, then hook failures, then test failures? The teardown/setup failure box should replace the need to look for "error" in the task logs.

Happy to chat more if the above comment doesn't help.

Comment by Eric Milkie [ 11/Sep/19 ]

But that isn't true, actually. After clicking on the links for the red boxes, I must then try to figure out what went wrong. The way that Evergreen divides up logs into tests is subtle, so I end up looking at a combination of all the "failed" test logs, and the task log, to try to figure out what the actual problem is. This involves trying to parse the task log, and then I end up at the point where my suggestion starts.

As an example, take the failure that Vlad linked above. There are three "tests" that are flagged as failed, but only one of them truly shows you logs that identify the problem.

Comment by Robert Guo (Inactive) [ 10/Sep/19 ]

Thanks for the suggestion Eric. The roundabout but utilitarian answer to both of your questions is that you don't have to worry about any of it now. All failures/errors/whatevers are expressed as red boxes in the Evergreen sidebar. The thing to remember now is just to click on the link(s) for the red box(es).

There are some historic reasons for the error/failure mishmash and printing the report summary 3 times. I will convert this ticket to address it.

Comment by Eric Milkie [ 10/Sep/19 ]

Cool, thanks. I think we should convert this ticket to a little cleanup in the task output for failures such as this one.
My issues:
1. The failure output is printed 3 times - is that really necessary?
2. The failure output says this (this is one of the three copies of this output):

 [2019/09/09 15:35:01.221] 524 test(s) ran in 963.35 seconds (521 succeeded, 67 were skipped, 2 failed, 1 errored)
 [2019/09/09 15:35:01.221]     fsm_workload_tests: 524 test(s) ran in 963.31 seconds (521 succeeded, 67 were skipped, 2 failed, 1 errored)
 [2019/09/09 15:35:01.221]         The following tests failed (with exit code):
 [2019/09/09 15:35:01.221]             create_database:CheckReplDBHashInBackground (1 DB Exception)
 [2019/09/09 15:35:01.221]             jstests/concurrency/fsm_workloads/create_database.js (2 UNKNOWN)
 [2019/09/09 15:35:01.221]         The following tests had errors:
 [2019/09/09 15:35:01.221]             job0_fixture_teardown

Can we make it so that no one has to remember what the difference between an "errored" test and a "failed" test is? (Is the difference that "failed" tests returned a bad status from running a program? If that's true, how can "1 DB Exception" be an exit code from such a test?) Even more confusingly, the output for job0_fixture_teardown in the task log literally says

 [2019/09/09 15:35:01.216] [executor:fsm_workload_test:job0] 2019-09-09T15:35:01.216+0000 job0_fixture_teardown ran in 1.52 seconds: failed.
 [2019/09/09 15:35:01.217] [executor:fsm_workload_test:job0] 2019-09-09T15:35:01.216+0000 The teardown of ReplicaSetFixture (Job #0) failed.

So did it fail or did it error? Ideally, we'd remove this distinction to reduce confusion.
I think keeping the reason for failure/error in the test logs is fine and we don't need to elevate that out into the task logs, as long as there are always test logs to look at. Are there ever cases where the task log will say a test failed or errored and there are no corresponding test logs to look at?

Comment by Vlad Rachev (Inactive) [ 10/Sep/19 ]

Yes here's one: https://evergreen.mongodb.com/task/mongodb_mongo_master_enterprise_rhel_62_64_bit_concurrency_replication_02dbec04bad4a6b138a204a29644a0524e5c3123_19_09_09_14_32_15

In the test logs:

[fsm_workload_test:job0_fixture_teardown] 2019-09-09T15:35:01.213+0000 An error occurred during the teardown of ReplicaSetFixture (Job #0): Error while stopping replica set member on port 20002: mongod on port 20002 was expected to be running, but wasn't. Process exited with code 14. - Error while stopping replica set member on port 20001: mongod on port 20001 was expected to be running, but wasn't. Process exited with code 14.

Comment by Eric Milkie [ 10/Sep/19 ]

We waited too long since Max posted his example back in January, so the log data has been purged from logkeeper. Do you have a more recent example of this?

Comment by Vlad Rachev (Inactive) [ 09/Sep/19 ]

milkie A detailed log message is written to the "test logs". That seems sufficient enough to me - would you also like that same log message to be written to the "task logs", as max.hirschhorn suggested? If not I will close the ticket.

Comment by Max Hirschhorn [ 23/Jan/19 ]

I believe the goal of this ticket should be to make the

[2019/01/23 20:01:11.595] [executor:js_test:job6] 2019-01-23T20:01:11.595+0000 The teardown of ReplicaSetFixture (Job #6) failed.

log message that written to the "task logs" include the same level of detail as the

[js_test:job6_fixture_teardown] 2019-01-23T20:01:11.594+0000 An error occurred during the teardown of ReplicaSetFixture (Job #6): Error while stopping replica set member on port 21501: mongod on port 21501 was expected to be running, but wasn't. Process exited with code -11.

log message that's written to the "test logs".

https://evergreen.mongodb.com/task/mongodb_mongo_master_linux_64_debug_replica_sets_terminate_primary_jscore_passthrough_ffb19c17927a5a0a07a7cd7cd7744cea80ab3e26_19_01_23_19_38_08

Comment by Michael O'Brien [ 05/Jan/16 ]

Writing causes of failure to stderr might also help when being run in evergreen, since log lines from stderr show up red so they are easily noticed.

Comment by Michael O'Brien [ 05/Jan/16 ]

Would also like to make job teardown messages stand out more when they fail, for example instead of:

[executor:js_test] 2016-01-04T23:51:30.774+0000 Teardown of MasterSlaveFixture (Job #8) was not successful.

something like:

[executor:js_test] 2016-01-04T23:51:30.774+0000 Teardown of MasterSlaveFixture (Job #8) FAILED.

Generated at Thu Feb 08 03:52:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.