TSAN internal failure when running test/format in disagg mode

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Storage Engines - Foundations
    • None
    • None

      TSAN fails with an internal failure if we run it for test/format in disagg mode for too long:

      ThreadSanitizer: CHECK failed: sanitizer_deadlock_detector.h:67 "((n_all_locks_)) < (((sizeof(all_locks_with_contexts_)/sizeof((all_locks_with_contexts_)[0]))))" (0x40, 0x40) (tid=5536)
      

      TSAN supports up to 64 locks being locked at the same time and that's a known limitation, there are a bunch of issues on the sanitisers github about this (and one of them was created by a MongoDB engineer about a similar case in WT).

      However, it's still worth checking whether in the current case, holding 64 locks is reasonable and intentional. Also, in the latest TSAN version this limit was increased to 128, so another option, if we do need to acquire 64 locks at the same time, would be to consider upgrading the TSAN version for mongodb toolchain.

      I've never seen this issue with non-disagg test/format execution, which is also interesting to understand, what's is the difference.

      We also can consider limiting the number of threads to increase the test duration and number of operations.

      Solving this issue will unblock us from running test format under TSAN for a longer time which could potentially help us to catch more data races, which is important.

            Assignee:
            [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            Ivan Kochin
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: