[SERVER-52532] Investigate if code during signal handling should use quickExitWithoutLogging Created: 02/Nov/20  Updated: 23/Feb/23  Resolved: 10/Feb/23

Status: Closed
Project: Core Server
Component/s: Internal Code
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Kevin Pulo Assignee: Billy Donahue
Resolution: Won't Do Votes: 0
Labels: re-triaged-ticket, servicearch-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-44570 Create a non process-fatal variant of... Closed
Assigned Teams:
Service Arch
Sprint: Service Arch 2023-02-20
Participants:

 Description   

SERVER-44570 added code to quickExit() [*] to call checkForTripwireAssertions(), which checks the tripwire atomic int and logs if it's non-zero. However, some signal handling code in util/signal_handlers_synchronous.cpp calls quickExit(). Therefore, if any of the code in checkForTripwireAssertions() isn't safe to call in a signal handler, then it should either be made safe, or if this isn't possible, then the signal handling code should instead call quickExitWithoutLogging().

[*]: Technically, it added code to the callers of quickExit(), by renaming the original quickExit() to quickExitWithoutLogging(), and making a new inline quickExit() that calls quickExitWithoutLogging().



 Comments   
Comment by Billy Donahue [ 10/Feb/23 ]

This is probably best handled as part of a project to fix up our signal handlers and their unsafe behaviors in a more thorough way. We're talking about maybe having a sidecar process that just handles log message and/or signal diagnostics.

The idea that our signal handlers are doing complicated work on their own is IMO not really feasible long-term. It's not even safe to get a backtrace, because unwinding can require opening debuginfo sections that are compressed, and so there's allocations involved etc. It's really best if the signal handlers just set a flag and await roadside assistance.

Comment by Billy Donahue [ 10/Feb/23 ]

The main things of concern are:

1/ loading the atomic (which I think is like/just a memory barrier, so maybe safe?)

2/ calling the logv2 functions without the logv2::LogTruncation::Disabled option (based on how writeMallocFreeStreamToLog calls logging)

I did some research to try to answer these concerns.

1/ Yes, loading the atomic with the default .load memory order is safe and consistent. No problem there.

2/ The logging is not really OK according to spec, but we do it all over these signal handlers already, and tassert isn't making that situation worse than it already is. It's just a little more of the same. the MallocFreeStreamToLog has truncation disabled because it's dumping a stack trace and other potentially large data accumulated in the stream, so we're trying to avoid truncation of that precious diagnostic data.

I also don't think the signal handers are calling quickExit. Maybe they used to when this was written? There's one quickExit call which happens when we catch recursive logging but that's unusual. That would occur only if we were inside a log statement and a synchronous signal was generated during the logging.

I think we can keep this one closed. We do know that the signal handlers don't make a very big effort to be sync-signal-safe. I think the allocator and C library are making it possible for use to be sloppy about certain technically unsafe things we're doing in the handlers. The handlers should probably just be setting bits, signalling an outside process to take action, and halting. Let the outside process do the rest.

Comment by Kevin Pulo [ 02/Nov/20 ]

The main things of concern are:

  1. loading the atomic (which I think is like/just a memory barrier, so maybe safe?)
  2. calling the logv2 functions without the logv2::LogTruncation::Disabled option (based on how writeMallocFreeStreamToLog calls logging)
Generated at Thu Feb 08 05:28:18 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.