[SERVER-46029] do not write core files in the hang analyzer when running locally (sans Evergreen) Created: 07/Feb/20  Updated: 29/Oct/23  Resolved: 21/Feb/20

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 4.3.4

Type: Task Priority: Major - P3
Reporter: Eric Milkie Assignee: David Percy
Resolution: Fixed Votes: 0
Labels: tig-hanganalyzer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Problem/Incident
Related
is related to SERVER-46160 Create a separate task to generate an... Backlog
is related to SERVER-46159 Add an option to the hang analyzer to... Closed
Backwards Compatibility: Fully Compatible
Sprint: STM 2020-02-24, STM 2020-03-09
Participants:
Linked BF Score: 0
Story Points: 0

 Description   

Currently, the hang analyzer can run for local testing if an assert.soon times out. This can write large core files into the current directory, silently, which can consume a lot of disk space. I think we should disable the writing of core files unless running under Evergreen.



 Comments   
Comment by Githook User [ 21/Feb/20 ]

Author:

{'name': 'Mikhail Shchatko', 'email': 'mikhail.shchatko@mongodb.com'}

Message: SERVER-46029 Do not write core files in the hang analyzer when running locally (sans Evergreen)
Branch: master
https://github.com/mongodb/mongo/commit/9fb1edd400526809c917e99ac4cfb6c9473baf72

Comment by Mikhail Shchatko [ 14/Feb/20 ]

Points 3 and 4 goes to a new tickets: SERVER-46159 and SERVER-46160

Comment by Robert Guo (Inactive) [ 11/Feb/20 ]

Per offline discussions with Max and Drew, we're going to have the hang analyzer download the debug symbols. To minimize the chance of the symbols not being ready (once we move them to a separate task), we will increase the timeout for assert.soon from 5 minutes to 10 for Evergreen runs, and lower the timeout for local runs from 5 minutes to 1.

Here's a breakdown of the work involved (cc mikhail.shchatko):
1. Add a TestData.inEVG boolean (for now) field that causes the hang analyzer to not run if it's set to false. The TestData needs to be passed in from resmoke.py here
2. Use the same inEVG flag to decide the default timeout for assert.soon. Set it to 1min for local runs and 10min for Evergreen runs
3. Add an option to the hang analyzer to download the debug symbols. Details TBD since we need to replace an s3.get in evergreen.yml.
4. Create a separate task to generate and upload the debug symbols

Comment by Spencer Jackson [ 10/Feb/20 ]

ryan.timmons, in my case I'm expecting that an assert.soon invoked by someone else to fail. My issue is the that parameter isn't plumbed through ReplSetTest.

Comment by Ryan Timmons [ 10/Feb/20 ]

If you're expecting assert.soon to fail, there's an additional parameter you can pass in that will prevent the hang-analyzer from running.

Comment by Spencer Jackson [ 07/Feb/20 ]

robert.guo, that's awesome, I wasn't aware of that. Thank you! We have a lot of SSL/TLS tests which configure replica sets in illegal configurations, and validate that they don't come online. This causes them to fail assert.soon inside the ReplicSetTest logic.

Comment by Robert Guo (Inactive) [ 07/Feb/20 ]

Re the first issue, we have a mechanism to globally disable the hang analyzer for places where the test expects assert.soon to fail. ( MongoRunner.runHangAnalyzer.disable()) for exactly this purpose. If that function is not sufficient, please let me know.

Re the issue about the value running the hang analyzer provides. I don't believe "the number of BFs" is a good indicator. There have been at least a dozen engineers who have asked for this feature, not to mention bugs we may have missed due to undiagnosed BFs. assert.soon() failures are especially tricky to debug due to 1. timeouts being more random and harder to repro 2. hangs (often) being more serious in a distributed system than crashes and 3. the fact that there's very little debugging info available from the logs for hangs/assert.soon() failures.

That said, I do also want debugging info to be collected asynchronously. One way could be to only collect core dumps in the task and run the hang analyzer against the core dump. But we'd need an offline post-processing feature in Evergreen to automate this process so hang analysis is still done.

Comment by Andrew Morrow (Inactive) [ 07/Feb/20 ]

I'm wondering if we should re-evaluate the integration of the hang analyzer with assert.soon. I'm aware of at least two other technical issues with it:

  • It can fire sometimes when an assert.soon doesn't indicate an actual hang. I believe spencer.jackson encountered such an issue.
  • It requires that we download all of the debug symbols at the start of every task, which slows down task all task startup by something like a minute.

So, this makes the third issue. And, it gets in the way of my eventual hope of deferring debug information collation outside of the compile task.

Do we have some metrics on how many BF investigations were made simple to resolve by way of having the hang analyzer run on a failed assert.soon? Are we sure the benefit we are getting is worth the complexity, bandwidth, and time?

Generated at Thu Feb 08 05:10:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.