[SERVER-26867] Timeout-related assert.soon failures should trigger the hang analyzer Created: 01/Nov/16 Updated: 19/Nov/19 Resolved: 31/Oct/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Testing Infrastructure |
| Affects Version/s: | None |
| Fix Version/s: | 4.3.1 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Mathias Stearn | Assignee: | Ryan Timmons |
| Resolution: | Done | Votes: | 11 |
| Labels: | stm, tig-hanganalyzer | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||
| Sprint: | STM 2019-10-31 | ||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 22 | ||||||||||||||||||||||||||||||||||||
| Story Points: | 2 | ||||||||||||||||||||||||||||||||||||
| Description |
|
When an test fails due to a timeout, it would be nice if that would trigger the hang analyzer before tearing down the processes. This would let us see what the server is doing that is preventing it from doing what we expect.
Update: Per discussion below we will be limiting this to assert.soon, assert.soonNoExcept, assert.retry, and assert.assert.retryNoExcept. |
| Comments |
| Comment by Ryan Timmons [ 31/Oct/19 ] |
|
This has been released. assert.soon, assert.retry, and assert.time will now call the hang-analyzer prior to throwing. |
| Comment by Githook User [ 31/Oct/19 ] |
|
Author: {'name': 'Ryan Timmons', 'username': 'rtimmons', 'email': 'ryan.timmons@mongodb.com'}Message: |
| Comment by Judah Schvimer [ 16/Oct/19 ] |
|
SGTM. wtimeout and maxTimeMS timeouts we've decided to largely stop providing so that tests hang until the process kills them and provides the hang analyzer output. This may not be ideal, but it works. |
| Comment by Ryan Timmons [ 16/Oct/19 ] |
|
You now call MongoRunner.runHangAnalyzer() in the shell to invoke hang_analyzer.py on all processes started by the shell and resmoke. This was done in It is trivial to call this from assert.soon prior to actually throwing. But the original text of this ticket indicates also calling this for wtimeout and $maxTimeMs expiring. I'm less confident with how we'd actually hook into those mechamisms within the shell without doing some trickery to monkeypatch calls to runCommand and inspect args for wtimeout/$maxTimeMs (same argument applies to assert.commandWorked). It seems like we get most of the benefit from
Unless there's a simple way to incorporate this function-call to other timeout scenarios, I propose that we don't automatically run hang_analyzer.py for anything other than these 4 assert methods. This aligns with the original scope-doc for hang_analyzer.py integration PM-1546 but I'm curious if there is a pressing need for more and/or an easy way to get these other things. Open questions to anyone watching this ticket:
If no objections in the coming day or two I'll update this ticket text/description to reflect just updates to assert.soon and friends. We can evaluate incorporating it further in future tickets. Please feel free to reach out to me/STM on slack if easier. |
| Comment by Max Hirschhorn [ 07/Aug/19 ] |
We're planning to write a scope document for this work in the next iteration. I don't imagine it'll need a design document but should have a sketch of the implementation as an appendix for reviewers to read through. It'd be worth having a conversation as to whether we'd want to bubble this up over the work on the query and aggregation fuzzers currently in-progress and the performance testing with slow networks that was slated to be next.
Let's decide on that after the scope document (esp. the appendix section) is written. |
| Comment by Judah Schvimer [ 07/Aug/19 ] |
|
max.hirschhorn and robert.guo, what is the timeline for doing #2? Would a repl team member be able to do it effectively? |
| Comment by Max Hirschhorn [ 06/Aug/19 ] |
|
I chatted with william.schultz in-person yesterday. I believe he and I are on the same page about the local testing experience and why we'd rather consider having different timeouts of "forever" when running in Evergreen versus when running locally. (I stand by 5 minutes is already too long to have to wait locally to find out you wrote the wrong condition inside of an assert.soon() while iterating on writing/debuggin a test, but it'd be challenging to reverse To summarize the different options for
I had imagined we'd achieve this ticket by doing #2. |
| Comment by William Schultz (Inactive) [ 05/Aug/19 ] |
|
BF-14272 (investigated by samy.lanka) is another case where getting stack traces after an awaitReplication timeout would have been very helpful. In a recent replication meeting we raised the question of why it was not considered acceptable to simply raise the default awaitReplication timeout to match the task timeout, so that the timeouts would trigger the hang analyzer. We thought that there may have been a prior objection to this on the grounds that it would make the local testing experience worse, but the details of the objection were not entirely clear. It would be good to clarify what the possible objections may be. Another proposal that came up was to allow the awaitReplication timeout to be configured separately when running in Evergreen versus running locally. This would avoid any negative side effects on the local testing experience. One option would be to allow the awaitReplication timeout to be overriden by setting a "TestData" parameter. Something similar was done for altering log verbosity as a part of |
| Comment by Siyuan Zhou [ 21/Jun/19 ] |
|
BF-13678 doesn't have sufficient evidence to know what's causing the timeout. Core dump would also be very helpful when awaitReplication() times out. |
| Comment by Dianna Hohensee (Inactive) [ 21/Apr/17 ] |
|
To jump on the bandwagon of providing use-cases incentives for this improvement: Would be nice for assert.soon failures like BF-5335, where we're concerned about deadlock because nothing is happening inexplicably and nothing of interest is getting logged (and we've been doing significant sharding code additions). I'm going to try to artificially provoke the hang analyzer, somehow, to get stack dumps along with bumping the verbosity in hopes of gleaning more information – this ticket would do it all for me, apparently |
| Comment by Judah Schvimer [ 14/Nov/16 ] |
|
This would make BF-3913 much easier to debug. It appears to be a hang elsewhere in the system that leads to writeConcern timing out. Knowing what else is happening on the secondary would make it clear what was hanging and causing the timeout. |
| Comment by Mathias Stearn [ 14/Nov/16 ] |
|
Expanding scope to include other timeouts after seeing a hard to debug failure of wtimeout. |
| Comment by Eric Milkie [ 01/Nov/16 ] |
|
Or rather, the threads are working normally (or are all idle), and it is the logic of the test that set up the failure to begin with. |
| Comment by Judah Schvimer [ 01/Nov/16 ] |
|
If we do this, assert.retry() should also probably be included. |
| Comment by Eric Milkie [ 01/Nov/16 ] |
|
I don't think this would be that useful. The overwhelming majority of assert.soon failures that I've seen are failing without the server threads actually being hung. |