Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-7617

Improve diagnosability of Python test hangs in Evergreen

    • Type: Icon: Task Task
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Labels:

      When a Python test hangs in Evergreen we don't get very good information in the test log files. It generally looks something like:

      [2021/05/25 03:09:33.066] test_compat03.test_compat03.test_compat03(26_patch_rel.future_max.33_min) (subunit.RemotedTestCase)
      [2021/05/25 03:09:33.066] test_compat03.test_compat03.test_compat03(26_patch_rel.future_max.33_min) ... ok
      [2021/05/25 05:09:34.049] Command stopped early: context canceled
      [2021/05/25 05:09:34.049] Running task-timeout commands.

      With no definitive statement about which test case is hanging. It's difficult to determine exactly which test hangs, since there are generally multiple tests running in parallel.

      There also seems to be an issue which is that our hang analyzer script isn't finding debug symbols, so doesn't show useful stack traces. e.g:

      [2021/05/25 05:09:36.223] 0x00007f5c52669f10 0x00007f5c5268a550 Yes /lib64/ld-linux-x86-64.so.2
      [2021/05/25 05:09:36.223] : Shared library is missing debugging information.
      [2021/05/25 05:09:36.223] Id Target Id Frame
      [2021/05/25 05:09:36.223] * 1 LWP 30268 "python3" 0x00007f5c516a37c6 in ?? ()
      [2021/05/25 05:09:36.224] 2 LWP 30282 "python3" 0x00007f5c52185184 in ?? ()
      [2021/05/25 05:09:36.224] Thread 2 (LWP 30282):
      [2021/05/25 05:09:36.224] #0 0x00007f5c52185184 in ?? ()
      [2021/05/25 05:09:36.224] #1 0x0000000000001000 in ?? ()
      [2021/05/25 05:09:36.224] #2 0x00007f5c4b6f77c0 in ?? ()
      [2021/05/25 05:09:36.224] #3 0x00007f5c33fff680 in ?? ()
      [2021/05/25 05:09:36.224] #4 0x0000000000001000 in ?? ()
      [2021/05/25 05:09:36.224] #5 0x00007f5c4b6f77c0 in ?? ()

      We should enhance our testing to make such failures easier to diagnose. An example failure can be see here.

            Assignee:
            backlog-server-storage-engines [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            alexander.gorrod@mongodb.com Alexander Gorrod
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: