[SERVER-38697] Powercycle kill by PID might try to kill the wrong process Created: 18/Dec/18  Updated: 29/Oct/23  Resolved: 01/Mar/19

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 3.6.12, 4.0.7, 4.1.9

Type: Bug Priority: Major - P3
Reporter: Jonathan Abrahams Assignee: Max Hirschhorn
Resolution: Fixed Votes: 0
Labels: tig-powercycle
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.0, v3.6
Sprint: STM 2019-03-11
Participants:
Linked BF Score: 40

 Description   

It's possible that a PID for a process can be reused causing the wrong process to be killed by powercycle. This problem has been seen primarily on Windows.



 Comments   
Comment by Githook User [ 05/Mar/19 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: SERVER-38697 Use process creation time to detect pid reuse.

Changes start_cmd() to use psutil.Popen() rather than subprocess.Popen()
in order to cache the creation time of the child process.

(cherry picked from commit d43369c671a596ee816f44038fca3423a0a33126)
Branch: v3.6
https://github.com/mongodb/mongo/commit/006233d0e02d6c902ba2ebd468f0314212d01bf4

Comment by Githook User [ 05/Mar/19 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: SERVER-38697 Use process creation time to detect pid reuse.

Changes start_cmd() to use psutil.Popen() rather than subprocess.Popen()
in order to cache the creation time of the child process.

(cherry picked from commit d43369c671a596ee816f44038fca3423a0a33126)
Branch: v4.0
https://github.com/mongodb/mongo/commit/f80f05151377eed5119a8c9cb52667d50f668610

Comment by Githook User [ 01/Mar/19 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: SERVER-38697 Use process creation time to detect pid reuse.

Changes start_cmd() to use psutil.Popen() rather than subprocess.Popen()
in order to cache the creation time of the child process.
Branch: master
https://github.com/mongodb/mongo/commit/d43369c671a596ee816f44038fca3423a0a33126

Comment by Max Hirschhorn [ 28/Feb/19 ]

Jonathan had proposed using the process creation time to detect if the pid is being reused. psutil would actually do that for us automatically but because we're storing a subprocess.Popen instance in Processes._PROC_LIST rather than a psutil.Popen instance (which derives from psutil.Process) so we aren't getting any benefit. The kill_process() function could then just take the psutil.Process instance directly instead of the pid.

Note the way this class is bound to a process is uniquely via its PID. That means that if the process terminates and the OS reuses its PID you may end up interacting with another process. The only exceptions for which process identity is preemptively checked (via PID + creation time) is for the following methods: nice() (set), ionice() (set), cpu_affinity() (set), rlimit() (set), children(), parent(), parents(), suspend() resume(), send_signal(), terminate() kill(). To prevent this problem for all other methods you can use is_running() before querying the process or process_iter() in case you’re iterating over all processes. It must be noted though that unless you deal with very “old” (inactive) Process instances this will hardly represent a problem.

https://psutil.readthedocs.io/en/latest/#psutil.Process

That all being said, job objects are the reliable way to do process management on Windows. Unfortunately, nested job objects are only supported starting in Windows Server 2012, and that's newer than the version of Windows we run our tests with on older branches. This means we'd risk leaking processes across Evergreen tasks if we attempted to have the created process break away from the job object created by the Evergreen agent. It'd be nice to one day use the TerminateJobObject() function to terminate all children processes between powercycle loops. The current approach of scanning over the list of all processes has checking its ppid has gaps if the parent process has already died, or is prone to races if a process is in the midst of spawning a child process. (An environment variable is the way the Evergreen agent approaches this issue, which in turn was taken from how Jenkins does it.)

http://nikhilism.com/post/2017/windows-job-objects-process-tree-management/ and https://lwn.net/Articles/754980/ are some other references I found interesting to read through when thinking about this subject.

Generated at Thu Feb 08 04:49:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.