Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-7617

Improve diagnosability of Python test hangs in Evergreen

    XMLWordPrintable

    Details

    • Type: Task
    • Status: Open
    • Priority: Major - P3
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: Backlog
    • Component/s: None
    • Labels:

      Description

      When a Python test hangs in Evergreen we don't get very good information in the test log files. It generally looks something like:

      [2021/05/25 03:09:33.066] test_compat03.test_compat03.test_compat03(26_patch_rel.future_max.33_min) (subunit.RemotedTestCase)
      [2021/05/25 03:09:33.066] test_compat03.test_compat03.test_compat03(26_patch_rel.future_max.33_min) ... ok
      [2021/05/25 05:09:34.049] Command stopped early: context canceled
      [2021/05/25 05:09:34.049] Running task-timeout commands.

      With no definitive statement about which test case is hanging. It's difficult to determine exactly which test hangs, since there are generally multiple tests running in parallel.

      There also seems to be an issue which is that our hang analyzer script isn't finding debug symbols, so doesn't show useful stack traces. e.g:

      [2021/05/25 05:09:36.223] 0x00007f5c52669f10 0x00007f5c5268a550 Yes /lib64/ld-linux-x86-64.so.2
      [2021/05/25 05:09:36.223] : Shared library is missing debugging information.
      [2021/05/25 05:09:36.223] Id Target Id Frame
      [2021/05/25 05:09:36.223] * 1 LWP 30268 "python3" 0x00007f5c516a37c6 in ?? ()
      [2021/05/25 05:09:36.224] 2 LWP 30282 "python3" 0x00007f5c52185184 in ?? ()
      [2021/05/25 05:09:36.224] Thread 2 (LWP 30282):
      [2021/05/25 05:09:36.224] #0 0x00007f5c52185184 in ?? ()
      [2021/05/25 05:09:36.224] #1 0x0000000000001000 in ?? ()
      [2021/05/25 05:09:36.224] #2 0x00007f5c4b6f77c0 in ?? ()
      [2021/05/25 05:09:36.224] #3 0x00007f5c33fff680 in ?? ()
      [2021/05/25 05:09:36.224] #4 0x0000000000001000 in ?? ()
      [2021/05/25 05:09:36.224] #5 0x00007f5c4b6f77c0 in ?? ()

      We should enhance our testing to make such failures easier to diagnose. An example failure can be see here.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              backlog-server-storage-engines Backlog - Storage Engines Team
              Reporter:
              alexander.gorrod Alexander Gorrod
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated: