[SERVER-82459] Fall back to default signal handler when a thread receives two signals Created: 26/Oct/23  Updated: 16/Nov/23  Resolved: 15/Nov/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.3.0-rc0

Type: Bug Priority: Major - P3
Reporter: George Wangensteen Assignee: Ryan Berryhill
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-83271 Make synchronous signal handlers sign... Open
related to SERVER-66562 Audit, document all functions accesse... Backlog
is related to SERVER-82658 Log system will attempt to allocate i... Open
Assigned Teams:
Service Arch
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Service Arch Prioritized List
Participants:

 Description   

In https://github.com/mongodb/mongo/blob/8e4b5670df9b9fe814e57cb5f3f8ee9407237b5a/src/mongo/util/signal_handlers_synchronous.cpp , the server defines signal-handlers for a variety of signals that can be synchronously generated, like SIGSEGV and SIGABRT. The signal-handling action for these signals is defined to be some version of logging a fatal error, collecting and logging a backtrace, and the exiting. When a thread receives a second such signal (e.g., it's handling an abort and the signal handler segfaults), the second signal handler calls quickExit, which may call into logging (risky when we're two signal handlers deep) and doesn't call into the default signal handler. This means we won't get a core dump in this case. We should call `endProcessWithSignal`.



 Comments   
Comment by Githook User [ 15/Nov/23 ]

Author:

{'name': 'Ryan Berryhill', 'email': 'ryan.berryhill@mongodb.com', 'username': 'ryanberryhill'}

Message: SERVER-82459 Avoid quickExit when a thread receives two fatal signals
Branch: master
https://github.com/mongodb/mongo/commit/b02caa1884fd508fa78d77fca4fa52c8764c8ea5

Comment by Billy Donahue [ 08/Nov/23 ]

New plan after Zoom discussion.

We're already doing a check on a global.

    explicit MallocFreeOStreamGuard() : _lk(_streamMutex, stdx::defer_lock) {
        if (terminateDepth++) {
            quickExit(ExitCode::abrupt);
        }
        _lk.lock();
    }

I think the problem is perhaps that this quickExit call needs to be more immediate death.

Regarding _lk: By the time this body runs, the _lk has been initialized, but with a defer_lock meaning that it doesn't lock it. It just unlocks it later in the destructor. You have to get to the _lk.lock() statement for that. That means nothing happens to the mutex really in that _lk initializer. The _lk object is just remembering which mutex to unlock later if necessary.

So yeah the problem with this whole handler is probably the quickExit trying to do stuff.
For example, quickExit calls warnIfTripwireAssertionsOccurred . I wonder if that would explain why the bug is hard to repro? If and only if there is a tripwire assertion in the lifetime of the process, we'll try to LOGV2 from the signal handler. That warnIfTripwireAssertionsOccurred is potentially the whole problem...
But we don't want our process to "exit", we want to be killed by a signal so the kernel will dump core and return an appropriately alarming wstatus to our launcher.

So I'm thinking the quickExit needs to go but we're otherwise ok.

Generated at Thu Feb 08 06:49:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.