[SERVER-81926] resmoke Process::stop function should NOT wait for process to exit in windows Created: 06/Oct/23  Updated: 11/Jan/24

Status: Backlog
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Yujin Kang Park Assignee: [DO NOT ASSIGN] Backlog - DevProd Correctness
Resolution: Unresolved Votes: 0
Labels: FY2025Q1, resmoke
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
is related to SERVER-25358 resmoke does not terminate mongod cle... Closed
Assigned Teams:
Correctness
Operating System: ALL
Participants:
Linked BF Score: 11

 Description   

Other parts of resmoke (such as ContinuousStepdown) assume stopping a process does not wait for it to exit cleanly. In this particular case, this is to immediately step up a secondary after the old primary has been shutdown, so that availability is restored as fast as possible.

This is already the case for the non-Windows implementation, but for some unknown reason for mongod, in TERMINATE mode in Windows, resmoke waits up to 60 seconds. This causes sporadic test failures due to a primary not being found.



 Comments   
Comment by Max Hirschhorn [ 07/Oct/23 ]

Modifying the windows logic to just set the event without waiting for it to complete may make it match the linux behavior better.

It is important for the mongod process to cleanly shut down on Windows to successfully bring up another mongod process with the same --dbpath and contents. I would recommend moving the waiting from happening in resmokelib.core.process.Process.stop() to be in resmokelib.core.process.Process.wait() instead. This may be what you intended for matching the behavior on Linux systems but I wanted to clarify it all the same.

Comment by Zack Winter [ 06/Oct/23 ]

It looks like the windows-specific logic here is to handle the difference in behavior between linux and windows when calling `process.terminate` in python: https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Process.terminate

On linux it's a SIGTERM, which gives the process a chance to handle it: https://www.gnu.org/software/libc/manual/html_node/Termination-Signals.html#index-SIGTERM

On windows it calls the TerminateProcess API which doesn't give the process any way to handle it: https://learn.microsoft.com/en-us/windows/win32/procthread/terminating-a-process

The windows-specific logic here is setting an event that mongod is listening for. Once mongod handles it, it'll call https://github.com/10gen/mongo/blob/SERVER-81852/src/mongo/util/signal_handlers.cpp#L158 just as it does when receiving a SIGTERM on linux. Afterwards, it'll reset the event and resmoke will continue execution.

I believe there is a difference here in that `process.terminate` in linux fires and forgets SIGTERM while we're waiting for the event to be received in windows. Modifying the windows logic to just set the event without waiting for it to complete may make it match the linux behavior better.

max.hirschhorn@mongodb.com does this seem reasonable to you?

Generated at Thu Feb 08 06:47:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.