[SERVER-49078] Suppress ThreadSanitizer: signal-unsafe call inside of a signal Created: 24/Jun/20 Updated: 29/Oct/23 Resolved: 15/Jul/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 4.7.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Mark Benvenuto | Assignee: | Andrew Morrow (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | thread-sanitizer | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Sprint: | Dev Platform 2020-07-13, Dev Platform 2020-07-27 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Linked BF Score: | 0 | ||||||||||||||||
| Description |
|
The printStackTrace function allocates memory and is called as part of signal handlers. ThreadSanitizer warns about any calls to malloc/free and new/delete as part of signal handling. These memory allocation calls are expected and the warning should be suppressed in etc/tsan.suppressions. Here is my proposed list from ad-hoc testing. The allocations are from many places: boost::log, RamLog, the backtrace code, BSONObj, json escaping, etc.
|
| Comments |
| Comment by Githook User [ 15/Jul/20 ] | ||||||||||||||||||||
|
Author: {'name': 'Andrew Morrow', 'email': 'acm@mongodb.com', 'username': 'acmorrow'}Message: | ||||||||||||||||||||
| Comment by Billy Donahue [ 14/Jul/20 ] | ||||||||||||||||||||
|
Sounds like the right way to go. | ||||||||||||||||||||
| Comment by Andrew Morrow (Inactive) [ 13/Jul/20 ] | ||||||||||||||||||||
|
Confirmed, this is a TSAN bug: https://github.com/google/sanitizers/issues/943 After applying the following diff:
I built like:
I was able to repro the crash observed in ryan.egesdahl's patch, and the situation aligns perfectly: libunwind + TSAN, and the stack is a good match to the bug report as well:
Perhaps we should disable libunwind with TSAN? billy.donahue what do you think? | ||||||||||||||||||||
| Comment by Ryan Egesdahl (Inactive) [ 09/Jul/20 ] | ||||||||||||||||||||
|
acm I added a suppression specifically for the test, but it causes a segfault for the reasons previously noted in this ticket. Do you think we should block this issue on fixing the segfault? I don't really have a way of verifying that all of the other tests will pass otherwise. | ||||||||||||||||||||
| Comment by Andrew Morrow (Inactive) [ 09/Jul/20 ] | ||||||||||||||||||||
|
ryan.egesdahl - I took a quick look at the test and I think it is pretty straightforward: your suppressions only hit when you are in the real signal handling paths like signal::abrubtQuitAction or signal::abruptQuitWithAddrSignal, but this test just explicitly sets a signal handler that does not flow through that code and then raises the signal explicitly. I think we would be fine to either add an explicit suppression for this test, or use an annotation from tsan_interface.h to do the same but in the code, or use __has_feature to disable this one specific unit test. | ||||||||||||||||||||
| Comment by Ryan Egesdahl (Inactive) [ 09/Jul/20 ] | ||||||||||||||||||||
|
We are still seeing some failures in this despite the suppressions I added. I think they are going through a code path I did not suppress. I am unsure as to what I am able to do further here. | ||||||||||||||||||||
| Comment by Ryan Egesdahl (Inactive) [ 07/Jul/20 ] | ||||||||||||||||||||
|
We're restricting this ticket to suppressing the known AS-unsafe calls. Fixing all of the failures in this ticket is proving to be more complex than we had originally thought. We will continue work in | ||||||||||||||||||||
| Comment by Githook User [ 07/Jul/20 ] | ||||||||||||||||||||
|
Author: {'name': 'Ryan Egesdahl', 'email': 'ryan.egesdahl@mongodb.com', 'username': 'deriamis'}Message: Two functions we call in signal handlers call malloc() down the line. | ||||||||||||||||||||
| Comment by Ryan Egesdahl (Inactive) [ 30/Jun/20 ] | ||||||||||||||||||||
|
In that case, I can probably add a suppression for it the same way we do for death tests. That will let me keep the suppression as narrow as possible. I'll add that, re-test, and update as appropriate. | ||||||||||||||||||||
| Comment by Billy Donahue [ 30/Jun/20 ] | ||||||||||||||||||||
|
stacktrace_test is a one-off doing unconventional C/Unix level things to torture the stack tracing and sigaltstack. | ||||||||||||||||||||
| Comment by Ryan Egesdahl (Inactive) [ 30/Jun/20 ] | ||||||||||||||||||||
|
Suppressing those functions work for thread_safety_context_test, but not stacktrace_test. It looks like that one might need more work. | ||||||||||||||||||||
| Comment by Billy Donahue [ 30/Jun/20 ] | ||||||||||||||||||||
|
The signal handlers we're interested in are set up by setupSynchronousSignalHandlers in src/mongo/util/signal_handlers_synchronous.cpp. This configures: {SIGQUIT,SIGABRT} to invoke `abruptQuitAction`. {SIGSEGV,SIGBUS,SIGILL,SIGFPE} to invoke `abruptQuitWithAddrSignal`. So that's already just two functions to suppress. We can try to just start there. | ||||||||||||||||||||
| Comment by Ryan Egesdahl (Inactive) [ 30/Jun/20 ] | ||||||||||||||||||||
|
billy.donahue That seems reasonable, if possibly a bit painful (sorry). I'm happy to help with the suppression and/or blacklist syntax when you get there, if necessary. It can sometimes be a bit obtuse. | ||||||||||||||||||||
| Comment by Billy Donahue [ 30/Jun/20 ] | ||||||||||||||||||||
|
I think you're right. There's not much in those annotations at all. A suppression applies to the whole call stack, so we should able to get away with only one suppression. We'll just make sure all the signal handling goes through one suppressed function. | ||||||||||||||||||||
| Comment by Ryan Egesdahl (Inactive) [ 30/Jun/20 ] | ||||||||||||||||||||
|
billy.donahue I've done some looking at our options for annotating this code, and I don't think we have an annotation available that fits our needs here. It doesn't look like we can tell TSAN "no, really, this is a safe use of malloc() in a signal handler" at this time: https://github.com/llvm-mirror/compiler-rt/blob/master/lib/tsan/rtl/tsan_interface_ann.cpp If we're going to allow this, I think the only options we have are to either use a blacklist or a suppression. Neither one of those is going to be conditional on when we're doing it safely unless we can restrict it to a particular function call that only happens when we are truly safe somehow. I leave the decision as to whether we suppress or blacklist up to the appropriate persons who know the code better. I certainly don't see at the moment how we can always ensure we are doing this safely if we suppress or blacklist. Or, if you see a relevant annotation that I've missed, please feel free to point it out to me. | ||||||||||||||||||||
| Comment by Andrew Morrow (Inactive) [ 27/Jun/20 ] | ||||||||||||||||||||
|
billy.donahue - We do already make use of <sanitizier/lsan_interface.h> so there is definitely prior art for using such annotations in the codebase. However, I'd like our default position on using annotations to paper over things to be that it is a last resort after investigating the alternatives. However, if we have already made a deliberate design decision to not attempt to limit our signal handlers to only async signal safe code, then explaining that decision to TSAN is the right thing to do. The SDP team will investigate what our options are here. | ||||||||||||||||||||
| Comment by Billy Donahue [ 27/Jun/20 ] | ||||||||||||||||||||
|
We should be relying on in-code annotations more than adding an external point of maintenance outside of C++ if we can help it. I don't have much experience with this but there are <sanitizer/tsan_interface.h> <sanitizer/lsan_interface.h>, etc full of goodies we might switch to instead. Ideally we would be able to put a wide RAII guard to disable this warning around the signal handlers in a single place using the right runtime hints. | ||||||||||||||||||||
| Comment by Billy Donahue [ 27/Jun/20 ] | ||||||||||||||||||||
|
I don't think I'm smart enough to rule out the AS (asynchronous signal) -unsafe calls (malloc/new/etc...) we make in our signal handlers as a cause of unexplained stacktrace failures. We just don't know what those failures are yet. Eric traced libunwind as it tried to open files and memory segments, etc, but I don't think there was a root cause identified there, and it could be that those internal operations were affected by the AS context. | ||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 25/Jun/20 ] | ||||||||||||||||||||
|
Thanks. On the other ticket milkie said he had reproduced that issue and also feels it is not caused by this. | ||||||||||||||||||||
| Comment by Andrew Morrow (Inactive) [ 25/Jun/20 ] | ||||||||||||||||||||
|
bruce.lucas - My intuition says they are separate issues, but, intuition is frequently wrong. | ||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 25/Jun/20 ] | ||||||||||||||||||||
|
Any chance this is causing | ||||||||||||||||||||
| Comment by Andrew Morrow (Inactive) [ 25/Jun/20 ] | ||||||||||||||||||||
|
bruce.lucas - Because malloc and new are not async signal safe. Ideally we would not be allocating in the signal handler. There were some discussions about ensuring that we did not during the work on backtrace/unwind/SIGUSR stuff, but I think the decision in the end was that the difficulty was too high given that in most cases we are rapidly headed for terminate when signals happen. billy.donahue should have more context. | ||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 25/Jun/20 ] | ||||||||||||||||||||
|
Just curious, why does it think calls to the allocator are unsafe inside signal handling code? I wonder if there's a good reason for it that we need to take into account in our signal handlers. |