[SERVER-47755] Send SIGABRT as a fallback in the hang analyzer Created: 24/Apr/20  Updated: 29/Oct/23  Resolved: 12/Oct/20

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 4.9.0

Type: Task Priority: Major - P3
Reporter: Raiden Worley (Inactive) Assignee: Raiden Worley (Inactive)
Resolution: Fixed Votes: 0
Labels: quick-win, tig-hanganalyzer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
Backwards Compatibility: Fully Compatible
Sprint: STM 2020-10-05, STM 2020-10-19
Participants:
Linked BF Score: 14
Story Points: 2

 Description   

Due to TIG-859, the hang analyzer may fail to attach and create core dumps in macOS tests. Core dumps are often the only way to get information about process state, and lack of them may completely block Server engineers on BFs, such as the recent BF-16858.

As a fallback measure, we should send SIGABRT to processes that the debugger has failed to create core dumps from. We have precedent from SERVER-45342 of sending externally-created SIGABRTs and logging messages to distinguish them from internally-generated aborts. Since the signal handler is a separate thread that doesn't take locks, it should work in the case of a hang.



 Comments   
Comment by Githook User [ 12/Oct/20 ]

Author:

{'name': 'Carl Raiden Worley', 'email': 'carl.worley@10gen.com', 'username': 'aggrand'}

Message: SERVER-47755 Send SIGABRT as a fallback in the hang analyzer
Branch: master
https://github.com/mongodb/mongo/commit/0dadef8dd93175bf3a75412d8a32b377d9eba42c

Comment by Githook User [ 12/Oct/20 ]

Author:

{'name': 'Carl Raiden Worley', 'email': 'carl.worley@10gen.com', 'username': 'aggrand'}

Message: SERVER-47755 Send SIGABRT as a fallback in the hang analyzer
Branch: master
https://github.com/mongodb/mongo/commit/d6f12000b477a80f444e59d304857c31853b036e

Comment by Raiden Worley (Inactive) [ 08/Oct/20 ]

In case anyone wants to run the hang analyzer locally after these changes: I was able to get consistent core dumps from SIGABRTs locally by running sudo sysctl kern.corefile="dump_%N.%P.core" and ulimit -c unlimited. This worked for me on macOS version 10.14.6, but apparently the sysctl API for this has changed a few times across versions.

SERVER-37462 and BUILD-4025 made those changes in CI, but it looks like some macOS hosts still aren't having the core pattern set, so I filed BUILD-12127 to address that.

We'll go ahead with the changes in this ticket, which will immediately improve the situation to create core dumps more often than not. We can expect the rest of the core dumps to start being successfully generated after BUILD-12127 is resolved.

Comment by Cristopher Stauffer [ 02/Oct/20 ]

Richard,

I wanted to reopen this to make we answer Samy's question and see if there are any other possible paths before we close this out. If in fact we have no other options, we should discuss what test coverage for MacOS looks like in the future.

Comment by Samyukta Lanka [ 29/Sep/20 ]

richard.samuels Repl has been seeing a lot of failures on MacOS without core dumps recently. Do you know if there's another ticket we could track with an alternative approach?

Comment by Richard Samuels (Inactive) [ 29/Sep/20 ]

SIGABRT does not reliably produce core dumps on macOS. It might even be less reliable that what we currently do with lldb.

Comment by Brooke Miller [ 28/May/20 ]

Need to wait until Archive Data Files Project (PM-1569) is done, or until the hang_analyzer behaves differently for macOS.

Comment by Brooke Miller [ 12/May/20 ]

Discussed that we will add an option to make it no op when it's running on mac in evergreen and add a code path to resmoke's signal handler to sigabt all processes when archival is not configured on MacOS in the case of a hang.

Comment by Raiden Worley (Inactive) [ 01/May/20 ]

Re above comment: SERVER-47880

Comment by Raiden Worley (Inactive) [ 24/Apr/20 ]

Worth thinking about: if we manage to send SIGSTOP to resolve TIG-768 as discussed in the comments of SERVER-46693, I'm not sure how that might affect the signal handler thread's response to a SIGABRT.

Generated at Thu Feb 08 05:15:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.