[SERVER-46029] do not write core files in the hang analyzer when running locally (sans Evergreen) Created: 07/Feb/20 Updated: 29/Oct/23 Resolved: 21/Feb/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Testing Infrastructure |
| Affects Version/s: | None |
| Fix Version/s: | 4.3.4 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Eric Milkie | Assignee: | David Percy |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | tig-hanganalyzer | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Sprint: | STM 2020-02-24, STM 2020-03-09 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Linked BF Score: | 0 | ||||||||||||||||
| Story Points: | 0 | ||||||||||||||||
| Description |
|
Currently, the hang analyzer can run for local testing if an assert.soon times out. This can write large core files into the current directory, silently, which can consume a lot of disk space. I think we should disable the writing of core files unless running under Evergreen. |
| Comments |
| Comment by Githook User [ 21/Feb/20 ] |
|
Author: {'name': 'Mikhail Shchatko', 'email': 'mikhail.shchatko@mongodb.com'}Message: |
| Comment by Mikhail Shchatko [ 14/Feb/20 ] |
|
Points 3 and 4 goes to a new tickets: |
| Comment by Robert Guo (Inactive) [ 11/Feb/20 ] |
|
Per offline discussions with Max and Drew, we're going to have the hang analyzer download the debug symbols. To minimize the chance of the symbols not being ready (once we move them to a separate task), we will increase the timeout for assert.soon from 5 minutes to 10 for Evergreen runs, and lower the timeout for local runs from 5 minutes to 1. Here's a breakdown of the work involved (cc mikhail.shchatko): |
| Comment by Spencer Jackson [ 10/Feb/20 ] |
|
ryan.timmons, in my case I'm expecting that an assert.soon invoked by someone else to fail. My issue is the that parameter isn't plumbed through ReplSetTest. |
| Comment by Ryan Timmons [ 10/Feb/20 ] |
|
If you're expecting assert.soon to fail, there's an additional parameter you can pass in that will prevent the hang-analyzer from running. |
| Comment by Spencer Jackson [ 07/Feb/20 ] |
|
robert.guo, that's awesome, I wasn't aware of that. Thank you! We have a lot of SSL/TLS tests which configure replica sets in illegal configurations, and validate that they don't come online. This causes them to fail assert.soon inside the ReplicSetTest logic. |
| Comment by Robert Guo (Inactive) [ 07/Feb/20 ] |
|
Re the first issue, we have a mechanism to globally disable the hang analyzer for places where the test expects assert.soon to fail. ( MongoRunner.runHangAnalyzer.disable()) for exactly this purpose. If that function is not sufficient, please let me know. Re the issue about the value running the hang analyzer provides. I don't believe "the number of BFs" is a good indicator. There have been at least a dozen engineers who have asked for this feature, not to mention bugs we may have missed due to undiagnosed BFs. assert.soon() failures are especially tricky to debug due to 1. timeouts being more random and harder to repro 2. hangs (often) being more serious in a distributed system than crashes and 3. the fact that there's very little debugging info available from the logs for hangs/assert.soon() failures. That said, I do also want debugging info to be collected asynchronously. One way could be to only collect core dumps in the task and run the hang analyzer against the core dump. But we'd need an offline post-processing feature in Evergreen to automate this process so hang analysis is still done. |
| Comment by Andrew Morrow (Inactive) [ 07/Feb/20 ] |
|
I'm wondering if we should re-evaluate the integration of the hang analyzer with assert.soon. I'm aware of at least two other technical issues with it:
So, this makes the third issue. And, it gets in the way of my eventual hope of deferring debug information collation outside of the compile task. Do we have some metrics on how many BF investigations were made simple to resolve by way of having the hang analyzer run on a failed assert.soon? Are we sure the benefit we are getting is worth the complexity, bandwidth, and time? |