[SERVER-57755] More aggressively fail tests where a node unexpectedly crashes instead of timing out Created: 16/Jun/21  Updated: 05/Feb/24  Resolved: 05/Feb/24

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Daniel Gottlieb (Inactive) Assignee: [DO NOT ASSIGN] Backlog - DevProd Correctness
Resolution: Won't Do Votes: 1
Labels: tig-resmoke
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Assigned Teams:
Correctness
Participants:
Linked BF Score: 165

 Description   

I assume this depends more on which kinds of tests are being run, but it's relatively common for a node crashing to result in the test coming to a halt and timing out instead of failing.

Major benefits:

  • Less time spent running tests. A test that takes seconds/minutes can crash and instead timeout after hours.
  • Large logs as a result of timeouts can be quite the challenge to inspect. We have improvements coming through for that, but it seems worthwhile to avoid recording GBs of log data when typically everything one needs to know is in the first ~100MB.

Apologies if this ticket already exists – I felt I had seen one before. I was struggling to come up with the right keywords to find it.



 Comments   
Comment by Thomas Langston [ 05/Feb/24 ]

It appears this issue has been resolved by other changes in another Epic. Please open a new issue if this issue reappears.

Comment by Steven Vannelli [ 10/May/22 ]

Moving this ticket to the Backlog and removing the "Backlog" fixVersion as per our latest policy for using fixVersions.

Generated at Thu Feb 08 05:42:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.