[SERVER-38060] Don't run after test hooks in resmoke if the test fails Created: 09/Nov/18  Updated: 27/Oct/23  Resolved: 05/Feb/20

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Robert Guo (Inactive) Assignee: Backlog - Server Tooling and Methods (STM) (Inactive)
Resolution: Gone away Votes: 0
Labels: tig-resmoke
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-38059 Transactions write conflicts tests sh... Closed
is related to SERVER-36770 Provide a way to manually clean up pr... Closed
is related to SERVER-37103 Add a hook to check for open transact... Closed
Assigned Teams:
Server Tooling & Methods
Participants:

 Description   

We should not run the resmoke correctness hooks if a test fails and the test suite runs with --continueOnFailure. In the best case it causes confusion in the evergreen side bar because there are multiple red boxes, one for the test and at least one for the hooks

In the worst case the failing test leaves the server in an inconsistent state, which can cause the hook to hang, making debugging much more difficult.



 Comments   
Comment by Ian Whalen (Inactive) [ 05/Feb/20 ]

closing as gone away as per last comment.

Comment by Ian Whalen (Inactive) [ 19/Dec/19 ]

We don't believe that this will make sense anymore with the completion of PM-1547 pending. We will likely close this as Won't Fix after that unless someone objects. Please let us know.

Comment by Judah Schvimer [ 09/Nov/18 ]

I agree with max.hirschhorn and would prefer to abort transactions before running any consistency check hooks to prevent hangs.

For example, if a replica set test partitions nodes, fails, and forgets to heal the partitions, will all data consistency checking hooks be compatible with this?

Tests that partition or fail nodes don't tend to run data consistency checks. They generally use their own fixture and mark that they shouldn't check data consistency.

Comment by William Schultz (Inactive) [ 09/Nov/18 ]

max.hirschhorn I see your point. Generally, I think that it is difficult to assert that every test, upon completion, leaves the database (or cluster), in some kind of "consistent" state, that won't interfere with all the consistency checks we may try to execute. For example, if a replica set test partitions nodes, fails, and forgets to heal the partitions, will all data consistency checking hooks be compatible with this? I'm not sure. For this specific case (transactions being left open at the end of tests), I agree that a separate hook for cleaning this up would be sensible. It should run before any other consistency checking hooks run, and would ideally kill any idle transactions and also report information about which transactions are being killed. I think that judah.schvimer mentioned that we may already need to build something inside the server similar to this, so that could be a starting point.

Comment by Max Hirschhorn [ 09/Nov/18 ]

Don't run after test hooks in resmoke if the test fails

I'm not confident about this proposal. There shouldn't be anything a test does that can cause data to be corrupt data, and so if a test fails and also happens to corrupt data that's interesting. Also, the decision about whether to archive data files is typically based around whether a resmoke.py hook fails so we may end up with less diagnostics when debugging the test failure.

Ok, cool. i had a test that hung because it tried to run repl oplog check after a transactions test that threw an exception and left a transaction open. We can probably have the test do better to clean up after itself, but it felt like if a test fails, there should be few guarantees about the state it leaves things in, and so shutting down the node without doing the checks seems reasonable. Is there a ticket for it?

william.schultz, I would propose that you create a new hook and add it to the test suites you're interested in if there are properties you want to assert about the server's state after a test runs. SERVER-20773 is something that could probably fall under this category.

Generated at Thu Feb 08 04:47:50 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.