[SERVER-49078] Suppress ThreadSanitizer: signal-unsafe call inside of a signal Created: 24/Jun/20  Updated: 29/Oct/23  Resolved: 15/Jul/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 4.7.0

Type: Bug Priority: Major - P3
Reporter: Mark Benvenuto Assignee: Andrew Morrow (Inactive)
Resolution: Fixed Votes: 0
Labels: thread-sanitizer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-49346 Re-enable use of libunwind with TSAN Closed
is related to SERVER-48622 stacktrace_test segfaults with TSAN Open
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Dev Platform 2020-07-13, Dev Platform 2020-07-27
Participants:
Linked BF Score: 0

 Description   

The printStackTrace function allocates memory and is called as part of signal handlers. ThreadSanitizer warns about any calls to malloc/free and new/delete as part of signal handling. These memory allocation calls are expected and the warning should be suppressed in etc/tsan.suppressions.

Here is my proposed list from ad-hoc testing. The allocations are from many places: boost::log, RamLog, the backtrace code, BSONObj, json escaping, etc.

signal:boost::log::v2s_mt_posix::aux::stateless_allocator<unsigned char>::allocate
signal:boost::log::v2s_mt_posix::aux::stateless_allocator<unsigned char>::deallocate
signal:boost::log::v2s_mt_posix::aux::stateless_allocator<char>::allocate
signal:boost::log::v2s_mt_posix::aux::stateless_allocator<char>::deallocate
signal:__gnu_cxx::new_allocator<boost::shared_ptr<boost::log::v2s_mt_posix::sinks::sink> >::allocate
signal:__gnu_cxx::new_allocator<boost::shared_ptr<boost::log::v2s_mt_posix::sinks::sink> >::deallocate
signal:mongo::mongoMalloc
signal:mongo::intrusive_ptr_release
signal:std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_assign
signal:__gnu_cxx::new_allocator<boost::shared_ptr<boost::log::v2s_mt_posix::sinks::sink> >::deallocate
signal:std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_mutate
signal:mongo::mongoRealloc
signal:__gnu_cxx::new_allocator<unsigned long>::allocate
signal:__gnu_cxx::new_allocator<unsigned long>::deallocate
signal:__gnu_cxx::new_allocator<mongo::BSONElement>::allocate
signal:__gnu_cxx::new_allocator<mongo::BSONElement>::deallocate
signal:__gnu_cxx::new_allocator<char>::allocate
signal:__gnu_cxx::new_allocator<char>::deallocate
signal:mongo::logv2::(anonymous namespace)::JSONValueExtractor::storeUnquoted
signal:mongo::StackTraceAddressMetadata::BaseAndName::~BaseAndName
signal:mongo::stack_trace_detail::(anonymous namespace)::printStackTraceImpl



 Comments   
Comment by Githook User [ 15/Jul/20 ]

Author:

{'name': 'Andrew Morrow', 'email': 'acm@mongodb.com', 'username': 'acmorrow'}

Message: SERVER-49078 Disable libunwind for TSAN
Branch: master
https://github.com/mongodb/mongo/commit/f07b26628c8aaa9cbd68cbeb096c08c778c8c4a8

Comment by Billy Donahue [ 14/Jul/20 ]

Sounds like the right way to go.

Comment by Andrew Morrow (Inactive) [ 13/Jul/20 ]

Confirmed, this is a TSAN bug: https://github.com/google/sanitizers/issues/943

After applying the following diff:

diff --git a/etc/tsan.suppressions b/etc/tsan.suppressions
index 08da28be65..f02173b579 100644
--- a/etc/tsan.suppressions
+++ b/etc/tsan.suppressions
@@ -23,3 +23,4 @@ race:src/third_party/wiredtiger/*
 # handlers.
 signal:abruptQuitAction
 signal:abruptQuitWithAddrSignal
+signal:StackTraceSigAltStackTest::tryHandler

I built like:

python3 ./buildscripts/scons.py --variables-files= --variables-files=etc/scons/mongodbtoolchain_v3_clang.vars --dbg=on --opt=on --allocator=system --sanitize=thread --install-action=hardlink --implicit-cache --build-fast-and-loose=on --link-model=dynamic ICECC= CCACHE= build/install/bin/stacktrace_test

I was able to repro the crash observed in ryan.egesdahl's patch, and the situation aligns perfectly: libunwind + TSAN, and the stack is a good match to the bug report as well:

Thread 3 (Thread 0x7fffec4fd700 (LWP 27974)):
#0  Callback () at /data/mci/76345995686809862f0b80d49f4a9e31/toolchain-builder/tmp/build-llvm.sh-IHB/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_rtl_mutex.cc:35
#1  MutexPreLock () at /data/mci/76345995686809862f0b80d49f4a9e31/toolchain-builder/tmp/build-llvm.sh-IHB/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_rtl_mutex.cc:149
#2  0x00005555555ed65e in __interceptor_pthread_mutex_lock () at /data/mci/76345995686809862f0b80d49f4a9e31/toolchain-builder/tmp/build-llvm.sh-IHB/llvm/projects/compiler-rt/lib/tsan/../sanitizer_common/sanitizer_common_interceptors.inc:4072
#3  0x00007ffff7fdd88a in _UIx86_64__mempool_free (pool=0x7ffff7fef808 <trace_cache_pool>, object=0x7fffee53dfe0) at src/third_party/unwind/dist/src/mi/mempool.c:179
#4  0x00007ffff7fe226b in trace_cache_free (arg=0x7fffee53dfe0) at src/third_party/unwind/dist/src/x86_64/Gtrace.c:72
#5  0x00007ffff598c408 in __nptl_deallocate_tsd () at pthread_create.c:300
#6  0x00007ffff598d81b in __nptl_deallocate_tsd () at ../sysdeps/nptl/futex-internal.h:200
#7  start_thread (arg=0x7fffec4fd700) at pthread_create.c:473
#8  0x00007ffff549ea3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Perhaps we should disable libunwind with TSAN? billy.donahue what do you think?

Comment by Ryan Egesdahl (Inactive) [ 09/Jul/20 ]

acm I added a suppression specifically for the test, but it causes a segfault for the reasons previously noted in this ticket. Do you think we should block this issue on fixing the segfault? I don't really have a way of verifying that all of the other tests will pass otherwise.

Comment by Andrew Morrow (Inactive) [ 09/Jul/20 ]

ryan.egesdahl - I took a quick look at the test and I think it is pretty straightforward: your suppressions only hit when you are in the real signal handling paths like signal::abrubtQuitAction or signal::abruptQuitWithAddrSignal, but this test just explicitly sets a signal handler that does not flow through that code and then raises the signal explicitly. I think we would be fine to either add an explicit suppression for this test, or use an annotation from tsan_interface.h to do the same but in the code, or use __has_feature to disable this one specific unit test.

Comment by Ryan Egesdahl (Inactive) [ 09/Jul/20 ]

We are still seeing some failures in this despite the suppressions I added. I think they are going through a code path I did not suppress. I am unsure as to what I am able to do further here.

Comment by Ryan Egesdahl (Inactive) [ 07/Jul/20 ]

We're restricting this ticket to suppressing the known AS-unsafe calls. Fixing all of the failures in this ticket is proving to be more complex than we had originally thought. We will continue work in SERVER-49346 to diagnose the remaining test failures.

Comment by Githook User [ 07/Jul/20 ]

Author:

{'name': 'Ryan Egesdahl', 'email': 'ryan.egesdahl@mongodb.com', 'username': 'deriamis'}

Message: SERVER-49078 Suppress TSAN reports for AS-unsafe calls in signal handlers

Two functions we call in signal handlers call malloc() down the line.
Because these functions generally happen while the thread was dying, we
will permit the AS-unsafe call, especially since we've never had any
problems with any allocators we've ever used.
Branch: master
https://github.com/mongodb/mongo/commit/6a8d8eb4da7705e9bbf46890051301a9e8ef3e90

Comment by Ryan Egesdahl (Inactive) [ 30/Jun/20 ]

In that case, I can probably add a suppression for it the same way we do for death tests. That will let me keep the suppression as narrow as possible. I'll add that, re-test, and update as appropriate.

Comment by Billy Donahue [ 30/Jun/20 ]

stacktrace_test is a one-off doing unconventional C/Unix level things to torture the stack tracing and sigaltstack.
We could exempt it from TSAN entirely as far as I'm concerned.

Comment by Ryan Egesdahl (Inactive) [ 30/Jun/20 ]

Suppressing those functions work for thread_safety_context_test, but not stacktrace_test. It looks like that one might need more work.

Comment by Billy Donahue [ 30/Jun/20 ]

The signal handlers we're interested in are set up by setupSynchronousSignalHandlers in src/mongo/util/signal_handlers_synchronous.cpp.

This configures:

    {SIGQUIT,SIGABRT} to invoke `abruptQuitAction`.

    {SIGSEGV,SIGBUS,SIGILL,SIGFPE} to invoke `abruptQuitWithAddrSignal`.

So that's already just two functions to suppress. We can try to just start there.

Comment by Ryan Egesdahl (Inactive) [ 30/Jun/20 ]

billy.donahue That seems reasonable, if possibly a bit painful (sorry). I'm happy to help with the suppression and/or blacklist syntax when you get there, if necessary. It can sometimes be a bit obtuse.

Comment by Billy Donahue [ 30/Jun/20 ]

I think you're right. There's not much in those annotations at all.
Oh well, it was worth a look. Thanks for trying.

A suppression applies to the whole call stack, so we should able to get away with only one suppression. We'll just make sure all the signal handling goes through one suppressed function.

Comment by Ryan Egesdahl (Inactive) [ 30/Jun/20 ]

billy.donahue I've done some looking at our options for annotating this code, and I don't think we have an annotation available that fits our needs here. It doesn't look like we can tell TSAN "no, really, this is a safe use of malloc() in a signal handler" at this time:

https://github.com/llvm-mirror/compiler-rt/blob/master/lib/tsan/rtl/tsan_interface_ann.cpp
https://github.com/llvm-mirror/compiler-rt/blob/master/lib/tsan/rtl/tsan_interface_ann.h

If we're going to allow this, I think the only options we have are to either use a blacklist or a suppression. Neither one of those is going to be conditional on when we're doing it safely unless we can restrict it to a particular function call that only happens when we are truly safe somehow.

I leave the decision as to whether we suppress or blacklist up to the appropriate persons who know the code better. I certainly don't see at the moment how we can always ensure we are doing this safely if we suppress or blacklist. Or, if you see a relevant annotation that I've missed, please feel free to point it out to me.

Comment by Andrew Morrow (Inactive) [ 27/Jun/20 ]

billy.donahue - We do already make use of <sanitizier/lsan_interface.h> so there is definitely prior art for using such annotations in the codebase. However, I'd like our default position on using annotations to paper over things to be that it is a last resort after investigating the alternatives. However, if we have already made a deliberate design decision to not attempt to limit our signal handlers to only async signal safe code, then explaining that decision to TSAN is the right thing to do. The SDP team will investigate what our options are here.

Comment by Billy Donahue [ 27/Jun/20 ]

We should be relying on in-code annotations more than adding an external point of maintenance outside of C++ if we can help it.

I don't have much experience with this but there are <sanitizer/tsan_interface.h> <sanitizer/lsan_interface.h>, etc full of goodies we might switch to instead.

Ideally we would be able to put a wide RAII guard to disable this warning around the signal handlers in a single place using the right runtime hints.

Comment by Billy Donahue [ 27/Jun/20 ]

I don't think I'm smart enough to rule out the AS (asynchronous signal) -unsafe calls (malloc/new/etc...) we make in our signal handlers as a cause of unexplained stacktrace failures. We just don't know what those failures are yet. Eric traced libunwind as it tried to open files and memory segments, etc, but I don't think there was a root cause identified there, and it could be that those internal operations were affected by the AS context.

Comment by Bruce Lucas (Inactive) [ 25/Jun/20 ]

Thanks. On the other ticket milkie said he had reproduced that issue and also feels it is not caused by this.

Comment by Andrew Morrow (Inactive) [ 25/Jun/20 ]

bruce.lucas - My intuition says they are separate issues, but, intuition is frequently wrong.

Comment by Bruce Lucas (Inactive) [ 25/Jun/20 ]

Any chance this is causing SERVER-47775?

Comment by Andrew Morrow (Inactive) [ 25/Jun/20 ]

bruce.lucas - Because malloc and new are not async signal safe. Ideally we would not be allocating in the signal handler. There were some discussions about ensuring that we did not during the work on backtrace/unwind/SIGUSR stuff, but I think the decision in the end was that the difficulty was too high given that in most cases we are rapidly headed for terminate when signals happen. billy.donahue should have more context.

Comment by Bruce Lucas (Inactive) [ 25/Jun/20 ]

Just curious, why does it think calls to the allocator are unsafe inside signal handling code? I wonder if there's a good reason for it that we need to take into account in our signal handlers.

Generated at Thu Feb 08 05:18:52 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.