[SERVER-81926] resmoke Process::stop function should NOT wait for process to exit in windows Created: 06/Oct/23 Updated: 11/Jan/24 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Testing Infrastructure |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Yujin Kang Park | Assignee: | [DO NOT ASSIGN] Backlog - DevProd Correctness |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | FY2025Q1, resmoke | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Assigned Teams: |
Correctness
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Participants: | |||||||||||||
| Linked BF Score: | 11 | ||||||||||||
| Description |
|
Other parts of resmoke (such as ContinuousStepdown) assume stopping a process does not wait for it to exit cleanly. In this particular case, this is to immediately step up a secondary after the old primary has been shutdown, so that availability is restored as fast as possible. This is already the case for the non-Windows implementation, but for some unknown reason for mongod, in TERMINATE mode in Windows, resmoke waits up to 60 seconds. This causes sporadic test failures due to a primary not being found. |
| Comments |
| Comment by Max Hirschhorn [ 07/Oct/23 ] |
It is important for the mongod process to cleanly shut down on Windows to successfully bring up another mongod process with the same --dbpath and contents. I would recommend moving the waiting from happening in resmokelib.core.process.Process.stop() to be in resmokelib.core.process.Process.wait() instead. This may be what you intended for matching the behavior on Linux systems but I wanted to clarify it all the same. |
| Comment by Zack Winter [ 06/Oct/23 ] |
|
It looks like the windows-specific logic here is to handle the difference in behavior between linux and windows when calling `process.terminate` in python: https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Process.terminate On linux it's a SIGTERM, which gives the process a chance to handle it: https://www.gnu.org/software/libc/manual/html_node/Termination-Signals.html#index-SIGTERM On windows it calls the TerminateProcess API which doesn't give the process any way to handle it: https://learn.microsoft.com/en-us/windows/win32/procthread/terminating-a-process The windows-specific logic here is setting an event that mongod is listening for. Once mongod handles it, it'll call https://github.com/10gen/mongo/blob/SERVER-81852/src/mongo/util/signal_handlers.cpp#L158 just as it does when receiving a SIGTERM on linux. Afterwards, it'll reset the event and resmoke will continue execution. I believe there is a difference here in that `process.terminate` in linux fires and forgets SIGTERM while we're waiting for the event to be received in windows. Modifying the windows logic to just set the event without waiting for it to complete may make it match the linux behavior better. max.hirschhorn@mongodb.com does this seem reasonable to you? |