Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-73290

WiredTiger hits limits of TSAN deadlock detector

    • Build

      In stress tests, WiredTiger will occassionally cause TSAN to fail with the following error because a thread is holding more than 64 mutexes at once:

      FATAL: ThreadSanitizer CHECK failed: /data/mci/4c5523d6b930f0c1f82f5452d6add3b6/toolchain-builder/tmp/build-llvm-v4.sh-FAX/llvm-project-llvmorg/compiler-rt/lib/sanitizer_common/sanitizer_deadlock_detector.h:67 "((n_all_locks_)) < (((sizeof(all_locks_with_contexts_)/sizeof((all_locks_with_contexts_)[0]))))" (0x40, 0x40)
           #0 __tsan::TsanCheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) /data/mci/4c5523d6b930f0c1f82f5452d6add3b6/toolchain-builder/tmp/build-llvm-v4.sh-FAX/llvm-project-llvmorg/compiler-rt/lib/tsan/rtl/tsan_rtl_report.cpp:47:25 (mongod+0xe9056)
           #1 __sanitizer::CheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) /data/mci/4c5523d6b930f0c1f82f5452d6add3b6/toolchain-builder/tmp/build-llvm-v4.sh-FAX/llvm-project-llvmorg/compiler-rt/lib/sanitizer_common/sanitizer_termination.cpp:78:5 (mongod+0x68c1f)
           #2 __sanitizer::DeadlockDetectorTLS<__sanitizer::TwoLevelBitVector<1ul, __sanitizer::BasicBitVector<unsigned long> > >::addLock(unsigned long, unsigned long, unsigned int) /data/mci/4c5523d6b930f0c1f82f5452d6add3b6/toolchain-builder/tmp/build-llvm-v4.sh-FAX/llvm-project-llvmorg/compiler-rt/lib/sanitizer_common/sanitizer_deadlock_detector.h:67:5 (mongod+0x5cc36)
           #3 onLockAfter /data/mci/4c5523d6b930f0c1f82f5452d6add3b6/toolchain-builder/tmp/build-llvm-v4.sh-FAX/llvm-project-llvmorg/compiler-rt/lib/sanitizer_common/sanitizer_deadlock_detector.h:216:11 (mongod+0x5c188)
           #4 __sanitizer::DD::MutexAfterLock(__sanitizer::DDCallback*, __sanitizer::DDMutex*, bool, bool) /data/mci/4c5523d6b930f0c1f82f5452d6add3b6/toolchain-builder/tmp/build-llvm-v4.sh-FAX/llvm-project-llvmorg/compiler-rt/lib/sanitizer_common/sanitizer_deadlock_detector1.cpp:169:6 (mongod+0x5c188)
           #5 __tsan::MutexPostLock(__tsan::ThreadState*, unsigned long, unsigned long, unsigned int, int) /data/mci/4c5523d6b930f0c1f82f5452d6add3b6/toolchain-builder/tmp/build-llvm-v4.sh-FAX/llvm-project-llvmorg/compiler-rt/lib/tsan/rtl/tsan_rtl_mutex.cpp:199:14 (mongod+0xe73a6)
           #6 pthread_mutex_lock /data/mci/4c5523d6b930f0c1f82f5452d6add3b6/toolchain-builder/tmp/build-llvm-v4.sh-FAX/llvm-project-llvmorg/compiler-rt/lib/tsan/../sanitizer_common/sanitizer_common_interceptors.inc:4239:5 (mongod+0x9e9f8)
           #7 __wt_spin_lock /data/mci/efa7a03527f42318eaf9b7f0563fa22b/src/src/third_party/wiredtiger/src/include/mutex_inline.h:171:16 (libwiredtiger.so+0x107884)
           #8 __split_ref_prepare /data/mci/efa7a03527f42318eaf9b7f0563fa22b/src/src/third_party/wiredtiger/src/btree/bt_split.c:366:9 (libwiredtiger.so+0x107884)
           #9 __split_root /data/mci/efa7a03527f42318eaf9b7f0563fa22b/src/src/third_party/wiredtiger/src/btree/bt_split.c:509:5 (libwiredtiger.so+0x101a07)
           #10 __split_parent_climb /data/mci/efa7a03527f42318eaf9b7f0563fa22b/src/src/third_party/wiredtiger/src/btree/bt_split.c:1330:19 (libwiredtiger.so+0x101a07)
           #11 __split_multi_lock /data/mci/efa7a03527f42318eaf9b7f0563fa22b/src/src/third_party/wiredtiger/src/btree/bt_split.c:2173:13 (libwiredtiger.so+0xffca5)
           #12 __wt_split_multi /data/mci/efa7a03527f42318eaf9b7f0563fa22b/src/src/third_party/wiredtiger/src/btree/bt_split.c:2191:5 (libwiredtiger.so+0xffca5)
           #13 __evict_page_dirty_update /data/mci/efa7a03527f42318eaf9b7f0563fa22b/src/src/third_party/wiredtiger/src/evict/evict_page.c:445:13 (libwiredtiger.so+0x1c3696)
           #14 __wt_evict /data/mci/efa7a03527f42318eaf9b7f0563fa22b/src/src/third_party/wiredtiger/src/evict/evict_page.c:226:9 (libwiredtiger.so+0x1c3696)
           #15 __evict_page /data/mci/efa7a03527f42318eaf9b7f0563fa22b/src/src/third_party/wiredtiger/src/evict/evict_lru.c:2331:5 (libwiredtiger.so+0x1bfd1e)
           #16 __evict_lru_pages /data/mci/efa7a03527f42318eaf9b7f0563fa22b/src/src/third_party/wiredtiger/src/evict/evict_lru.c:1144:20 (libwiredtiger.so+0x1bd3ca)
           #17 __wt_evict_thread_run /data/mci/efa7a03527f42318eaf9b7f0563fa22b/src/src/third_party/wiredtiger/src/evict/evict_lru.c:320:9 (libwiredtiger.so+0x1b8fdb)
           #18 __thread_run /data/mci/efa7a03527f42318eaf9b7f0563fa22b/src/src/third_party/wiredtiger/src/support/thread_group.c:31:9 (libwiredtiger.so+0x2c2668)
           #19 __tsan_thread_start_func /data/mci/4c5523d6b930f0c1f82f5452d6add3b6/toolchain-builder/tmp/build-llvm-v4.sh-FAX/llvm-project-llvmorg/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:955:15 (mongod+0x80aac)
           #20 start_thread <null> (libpthread.so.0+0x82dd)
           #21 clone <null> (libc.so.6+0xfca62)
      

      We investigated this failure and concluded that it was not a bug in WiredTiger. This is expected WiredTiger behavior in __split_ref_prepare where we need to lock the references to all child pages that get moved to a new page during a page split. This problem is only reproducible in instrumented tests which intentionally generate load and cause excessive page splits. This is an assertion in TSAN itself, not an actual deadlock.

      The issue is that this assertion is not suppressible using a TSAN suppression like "deadlock:path/to/file" to target specific code as we do already for data races. The only way to avoid this failure is to entirely disable the deadlock detector across the codebase, which is not what we want.

      I filed this issue to track the problem with the sanitizer. We can either continue to run less stressful tests without TSAN, or we can wait on the outcome of the ticket I filed to see if the maintainers would be willing to support higher mutex depths.

            Assignee:
            Unassigned Unassigned
            Reporter:
            louis.williams@mongodb.com Louis Williams
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated: