[SERVER-38697] Powercycle kill by PID might try to kill the wrong process Created: 18/Dec/18 Updated: 29/Oct/23 Resolved: 01/Mar/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Testing Infrastructure |
| Affects Version/s: | None |
| Fix Version/s: | 3.6.12, 4.0.7, 4.1.9 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Jonathan Abrahams | Assignee: | Max Hirschhorn |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | tig-powercycle | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Operating System: | ALL | ||||||||||||
| Backport Requested: |
v4.0, v3.6
|
||||||||||||
| Sprint: | STM 2019-03-11 | ||||||||||||
| Participants: | |||||||||||||
| Linked BF Score: | 40 | ||||||||||||
| Description |
|
It's possible that a PID for a process can be reused causing the wrong process to be killed by powercycle. This problem has been seen primarily on Windows. |
| Comments |
| Comment by Githook User [ 05/Mar/19 ] |
|
Author: {'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}Message: Changes start_cmd() to use psutil.Popen() rather than subprocess.Popen() (cherry picked from commit d43369c671a596ee816f44038fca3423a0a33126) |
| Comment by Githook User [ 05/Mar/19 ] |
|
Author: {'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}Message: Changes start_cmd() to use psutil.Popen() rather than subprocess.Popen() (cherry picked from commit d43369c671a596ee816f44038fca3423a0a33126) |
| Comment by Githook User [ 01/Mar/19 ] |
|
Author: {'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}Message: Changes start_cmd() to use psutil.Popen() rather than subprocess.Popen() |
| Comment by Max Hirschhorn [ 28/Feb/19 ] |
|
Jonathan had proposed using the process creation time to detect if the pid is being reused. psutil would actually do that for us automatically but because we're storing a subprocess.Popen instance in Processes._PROC_LIST rather than a psutil.Popen instance (which derives from psutil.Process) so we aren't getting any benefit. The kill_process() function could then just take the psutil.Process instance directly instead of the pid.
That all being said, job objects are the reliable way to do process management on Windows. Unfortunately, nested job objects are only supported starting in Windows Server 2012, and that's newer than the version of Windows we run our tests with on older branches. This means we'd risk leaking processes across Evergreen tasks if we attempted to have the created process break away from the job object created by the Evergreen agent. It'd be nice to one day use the TerminateJobObject() function to terminate all children processes between powercycle loops. The current approach of scanning over the list of all processes has checking its ppid has gaps if the parent process has already died, or is prone to races if a process is in the midst of spawning a child process. (An environment variable is the way the Evergreen agent approaches this issue, which in turn was taken from how Jenkins does it.) http://nikhilism.com/post/2017/windows-job-objects-process-tree-management/ and https://lwn.net/Articles/754980/ are some other references I found interesting to read through when thinking about this subject. |