Hang analyzer for WT Evergreen tests

XMLWordPrintableJSON

    • Type: Improvement
    • Resolution: Fixed
    • Priority: Major - P3
    • WT10.0.0, 4.9.0, 4.4.3
    • Affects Version/s: None
    • Component/s: None
    • Storage - Ra 2020-11-16
    • 3

      Triaging and diagnosing hang failures in automated Evergreen tests should be easier.  After determining that a test is hung, Evergreen should automatically collect and report data that will help with the initial triage and diagnosis of the problem.  Ideally we might collect:

      • What test programs were running at the time of the hang. 
      • The WiredTiger directory for those tests (I believe we already keep this for all tests)
      • Cores of the hung process(es), to help engineers determine why they were hung
      • Stack traces from the hung processes, to include in the Evergreen logs to facilitate triage.

      There is probably other stuff that would be useful as well.

      MongoDB's resmoke.py includes a hang-analyzer that they use for this purpose, buildscripts/resmokelib/commands/hang_analyzer.py.  We might be able to use it as the basis for a WT hang analyzer, or simply steal it outright.

              Assignee:
              Ravi Giri
              Reporter:
              Keith Smith
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: