ASAN hides memory leaks in python tests when executed in parallel

XMLWordPrintableJSON

    • Type: Build Failure
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Test Python
    • Storage Engines - Foundations
    • None
    • 0

      That's not clear why this is happening, you can simply reproduce it by running these commands:

      export PATH=/opt/mongodbtoolchain/v5/bin:$PATH
      cmake -DCMAKE_BUILD_TYPE=ASan    -DCMAKE_TOOLCHAIN_FILE=../cmake/toolchains/clang.cmake ..
      cmake --build . -j 16
      
      export COMMON_SAN_OPTIONS="abort_on_error=1:disable_coredump=0"
      export ASAN_OPTIONS="$COMMON_SAN_OPTIONS:unmap_shadow_on_exit=1"
      export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$(dirname $(/opt/mongodbtoolchain/v5/bin/clang -print-file-name=libclang_rt.asan.so))"
      export TESTUTIL_BYPASS_ASAN=1
      export LSAN_OPTIONS="$COMMON_SAN_OPTIONS:print_suppressions=0:suppressions=$(git rev-parse --show-toplevel)/test/evergreen/asan_leaks.supp"
      
      # Passes successfully
      python3 ../test/suite/run.py test_verify --asan -j 2 
      
      # Shows real warning
      python3 ../test/suite/run.py test_verify --asan
      

      The warning that is shown in this case is:

      ==9672==ERROR: LeakSanitizer: detected memory leaks
      
      Direct leak of 53248 byte(s) in 2 object(s) allocated from:
          #0 0x7f1aad305cdc in realloc /data/mci/35d44bafaf07d0d34e58f8888b61215a/toolchain-builder/tmp/build-llvm-v5.sh-dB2/llvm-project-llvmorg/compiler-rt/lib/asan/asan_malloc_linux.cpp:82:3
          #1 0x7f1aa89ae0f1 in __realloc_func /home/ubuntu/work/git/wiredtiger/src/os_common/os_alloc.c:160:18
          #2 0x7f1aa89ae31e in __wt_realloc_noclear /home/ubuntu/work/git/wiredtiger/src/os_common/os_alloc.c:198:13
          #3 0x7f1aa8c9b910 in __wt_buf_grow_worker /home/ubuntu/work/git/wiredtiger/src/support/scratch.c:52:9
          #4 0x7f1aa820e01d in __wt_buf_grow /home/ubuntu/work/git/wiredtiger/src/include/buf_inline.h:24:9
          #5 0x7f1aa820d097 in __wt_buf_init /home/ubuntu/work/git/wiredtiger/src/include/buf_inline.h:57:13
          #6 0x7f1aa820a228 in __wti_block_read_off /home/ubuntu/work/git/wiredtiger/src/block/block_read.c:198:5
          #7 0x7f1aa820971a in __wt_bm_read /home/ubuntu/work/git/wiredtiger/src/block/block_read.c:47:5
          #8 0x7f1aa825666c in __bm_read /home/ubuntu/work/git/wiredtiger/src/block_cache/block_mgr.c:557:13
          #9 0x7f1aa8241c75 in __wt_blkcache_read /home/ubuntu/work/git/wiredtiger/src/block_cache/block_io.c:191:13
          #10 0x7f1aa82475c8 in __wt_blkcache_read_multi /home/ubuntu/work/git/wiredtiger/src/block_cache/block_io.c:442:9
          #11 0x7f1aa84068f0 in __page_read /home/ubuntu/work/git/wiredtiger/src/btree/bt_read.c:282:5
          #12 0x7f1aa840349f in __wt_page_in_func /home/ubuntu/work/git/wiredtiger/src/btree/bt_read.c:529:13
          #13 0x7f1aa84dab99 in __verify_tree /home/ubuntu/work/git/wiredtiger/src/btree/bt_vrfy.c:866:19
          #14 0x7f1aa84daeca in __verify_tree /home/ubuntu/work/git/wiredtiger/src/btree/bt_vrfy.c:884:19
          #15 0x7f1aa84daeca in __verify_tree /home/ubuntu/work/git/wiredtiger/src/btree/bt_vrfy.c:884:19
          #16 0x7f1aa84d2fd8 in __wt_verify /home/ubuntu/work/git/wiredtiger/src/btree/bt_vrfy.c:313:13
          #17 0x7f1aa8b88845 in __wti_execute_handle_operation /home/ubuntu/work/git/wiredtiger/src/schema/schema_worker.c:32:5
          #18 0x7f1aa8b891d0 in __wt_schema_worker /home/ubuntu/work/git/wiredtiger/src/schema/schema_worker.c:191:13
          #19 0x7f1aa8b89953 in __wt_schema_worker /home/ubuntu/work/git/wiredtiger/src/schema/schema_worker.c:225:17
          #20 0x7f1aa8bbb1ed in __session_verify /home/ubuntu/work/git/wiredtiger/src/session/session_api.c:1817:5
          #21 0x7f1aac35b239 in _wrap_Session_verify /home/ubuntu/work/git/wiredtiger/build/lang/python/CMakeFiles/wiredtiger_python.dir/wiredtigerPYTHON_wrap.c:7597:21
          #22 0x7f1aacf3cee7 in cfunction_call /data/mci/ffd7bdd8113a1675e4d94dc66e419fee/toolchain-builder/tmp/build-python-v4.sh-VJS/build-Python-3.10.4/../src/Python-3.10.4/Objects/methodobject.c:552:1
      

      This issue really occurs when the function fails with a “potential hardware corruption” error (that's simulated by the test), which aborts the process. Since the memory is not freed in that case, it is technically a memory leak. Maybe that’s not the best example, but there are plenty of issues reported across many different tests, and some of them are real.

      For example, this one was detected by running Python tests without parallel execution: https://github.com/wiredtiger/wiredtiger/pull/12753 . It wasn’t detected by CI because Python tests with ASAN are run in parallel there.

      I ran a few experiments to understand how to fix the issue precisely. I even tried rewriting the parallel execution logic to catch the warnings (https://github.com/wiredtiger/wiredtiger/tree/try-remove-concurrencytest), but that didn’t help. My current suspicion is that this is simply a limitation of ASAN’s design—it doesn’t behave well when tests are executed in parallel processes.

      However, it’s worth noting that I also tested other sanitizers, and they seem to behave normally. Still, I’m not 100% sure, because the correct approach would be to intentionally introduce errors in different parts of the code and verify whether CI consistently reports them.

      I’m also not certain whether all other testing types detect all warnings. As far as I see, format.sh always creates a background job to run tests, which could also be problematic.

      I created WT-16067 to track further progress.

            Assignee:
            Ivan Kochin
            Reporter:
            Ivan Kochin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: