[SERVER-46687] Run hang-analyzer from resmoke and integrate with archival Created: 06/Mar/20  Updated: 29/Oct/23  Resolved: 13/Aug/20

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 4.7.0

Type: Improvement Priority: Major - P3
Reporter: Vlad Rachev (Inactive) Assignee: Vlad Rachev (Inactive)
Resolution: Fixed Votes: 0
Labels: tig-hanganalyzer, tig-resmoke
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-48705 resmoke.py sending SIGABRT to take co... Closed
is depended on by SERVER-48895 Complete TODO listed in SERVER-46691 Closed
is depended on by SERVER-46691 Rework the timeout task in evergreen.... Closed
Duplicate
duplicates SERVER-37154 hang_analyzer should not run against ... Closed
is duplicated by SERVER-46820 Kill hung processes as the last step ... Closed
is duplicated by SERVER-48728 Complete TODO listed in SERVER-46691 Closed
is duplicated by SERVER-46691 Rework the timeout task in evergreen.... Closed
Backwards Compatibility: Fully Compatible
Sprint: STM 2020-06-01, STM 2020-06-15, STM 2020-06-29, STM 2020-07-27, STM 2020-08-10, STM 2020-08-24
Participants:
Story Points: 7

 Description   

NOTE: This should be behind a flag. SERVER-46691 will remove it.

When a test or task times out in evergreen, resmoke will be sent a sigusr1 signal. The signal handler in resmoke will be modified to call the hang-analyzer on all tests that are still running. In the case of a test timeout, there should only be one, but in the case of a task timeout there can be multiple jobs.

Some complexity exists in determining the test pids. For tests started by resmoke fixtures, we can grab the fixture pids themselves. For tests using mongorunner to start processes, use process.children on the mongo shell process to get the list.



 Comments   
Comment by Githook User [ 13/Aug/20 ]

Author:

{'name': 'vrachev', 'email': 'vlad.rachev@mongodb.com', 'username': 'vrachev'}

Message: SERVER-46687 Run hang-analyzer from resmoke and integrate with archival
Branch: master
https://github.com/mongodb/mongo/commit/5266a96260f70fc4d4d561d505f62bba4c2ff76b

Comment by Githook User [ 29/Jul/20 ]

Author:

{'name': 'Robert Guo', 'email': 'robert.guo@mongodb.com'}

Message: Revert "SERVER-46687 Run hang-analyzer from resmoke and integrate with archival"

This reverts commit 0cbc4ea6a9865906736bae49be34e4359dd3853e.
Branch: master
https://github.com/mongodb/mongo/commit/9ef7c7a0fb72b8ee7cca3218685752d21072c1a2

Comment by Githook User [ 28/Jul/20 ]

Author:

{'name': 'Robert Guo', 'email': 'robert.guo@mongodb.com'}

Message: SERVER-46687 Run hang-analyzer from resmoke and integrate with archival

This reverts commit 6ac765dd18a96bbe43eb22a30ddaf3a4d42ae2e3.
Branch: master
https://github.com/mongodb/mongo/commit/0cbc4ea6a9865906736bae49be34e4359dd3853e

Comment by Githook User [ 28/Jul/20 ]

Author:

{'name': 'Henrik Edin', 'email': 'henrik.edin@mongodb.com', 'username': 'henrikedin'}

Message: Revert "SERVER-46687 Run hang-analyzer from resmoke and integrate with archival"

This reverts commit aa5754b408d61bd941ef20f7cd4525e6768cff6d.
Branch: master
https://github.com/mongodb/mongo/commit/6ac765dd18a96bbe43eb22a30ddaf3a4d42ae2e3

Comment by Githook User [ 27/Jul/20 ]

Author:

{'name': 'vrachev', 'email': 'vlad.rachev@mongodb.com', 'username': 'vrachev'}

Message: SERVER-46687 Run hang-analyzer from resmoke and integrate with archival
Branch: master
https://github.com/mongodb/mongo/commit/aa5754b408d61bd941ef20f7cd4525e6768cff6d

Comment by Robert Guo (Inactive) [ 23/Jul/20 ]

Code is complete. Assigning to myself for now to look at failures in the commit queue since Vlad's on vacation.

Comment by Vlad Rachev (Inactive) [ 23/Jun/20 ]

Putting this on hold for now while focusing on SERVER-48690 and SERVER-48961 for the release.

Comment by Vlad Rachev (Inactive) [ 11/Jun/20 ]

Closed SERVER-46820 and SERVER-46691 as dupes of this as the work for all 3 makes sense to be done in one ticket, and we can't use the functionality of any of the tickets without all 3 being done.

The functionality will be:

  • When `resmoke.py run` is called, resmoke will write it's own pid to a file in ./build.
  • When a timeout occurs, the evg timeout task will call `resmoke run-timeout` (no args needed). This script will look at the file above to check for the pid of the resmoke process, and it will signal the python thread (using the hang-analyzer). It will then wait for the process to exit. This it to prevent evergreen from killing processes before the hang-analyzer has finished.
  • The resmoke signal handler will call the hang-analyzer on all of the processes it has started. The list of pids will be tracked by: 1) adding the pid of processes started in process.py to config.py, and 2) using psutil's Process.children() to get processes started by mongorunner.
  • The signal handler will then shut down all pids, and archival can proceed if specified.
Generated at Thu Feb 08 05:12:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.