[SERVER-39883] Powercycle doesn't actually wait for the mongod process to exit during shutdown_mongod Created: 28/Feb/19  Updated: 29/Oct/23  Resolved: 24/Jun/20

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 4.4.0-rc11, 4.7.0

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Daniel Gottlieb (Inactive)
Resolution: Fixed Votes: 1
Labels: tig-powercycle
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
is related to WT-6412 Fix extended stalls being seen during... Closed
is related to SERVER-35506 The Powercycle wait_for_mongod_shutdo... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4
Sprint: STM 2020-06-29
Participants:
Linked BF Score: 50
Story Points: 3

 Description   

The "shutdown_mongod" action runs the {shutdown: 1, force: true} command and then (on Linux) waits for psutil to say no processes with the name "mongod" exist. The wait_for_mongod_shutdown() function then sleeps an arbitrary extra 5 seconds in order to wait for any pending I/O to finish. It possible for 5 seconds to not be long enough where a file will disappear when running rsync or the mongod process will fail to start.



 Comments   
Comment by Githook User [ 24/Jun/20 ]

Author:

{'name': 'Daniel Gottlieb', 'email': 'daniel.gottlieb@mongodb.com', 'username': 'dgottlieb'}

Message: SERVER-39883: Have Posix powercycle tests check the MongoDB lock file to signal process termination.

(cherry picked from commit 128ea14211bc3b0925c5788a848b3c696743f540)
Branch: v4.4
https://github.com/mongodb/mongo/commit/77d222a1b6aab0763ef64c9a7712b1aace60bfe8

Comment by Githook User [ 24/Jun/20 ]

Author:

{'name': 'Daniel Gottlieb', 'email': 'daniel.gottlieb@mongodb.com', 'username': 'dgottlieb'}

Message: SERVER-39883: Have Posix powercycle tests check the MongoDB lock file to signal process termination.
Branch: master
https://github.com/mongodb/mongo/commit/128ea14211bc3b0925c5788a848b3c696743f540

Comment by Daniel Gottlieb (Inactive) [ 18/Jun/20 ]

Thanks for the heads up bruce.lucas. WT-6416 is included (though, barely) in my patch runs.

Comment by Bruce Lucas (Inactive) [ 18/Jun/20 ]

There was also an issue causing slow shutdown that was fixed by WT-6416 - do the tests that are timing out include that fix?

Comment by Eric Milkie [ 17/Jun/20 ]

The shutdown slowness might be related to WT-6412, so once that ticket is fixed the failure frequency might go down more (to zero, hopefully).

Comment by Daniel Gottlieb (Inactive) [ 17/Jun/20 ]

ian.whalen I've been able to fix exactly what the ticket is describing (I believe), but we're still seeing failures for some of the tasks. WT shutdown does much more work than it used to, the test is causing WT to take in excess of 10-20 minutes to shutdown. I don't know how much I can realistically bump up timeouts (even finding the right timeout to bump is a challenge).

Comment by Ian Whalen (Inactive) [ 17/Jun/20 ]

Marking this as 4.4.0 since it is presumably a release blocker that power cycle is failing 100%.

Comment by Jonathan Abrahams [ 05/Mar/19 ]

Per discussion, the logic to check for mongod.lock was removed when refactoring for an OS-independent solution. However, if it makes sense to do it for Linux & Windows differently then we should go with your approach.

Comment by Max Hirschhorn [ 01/Mar/19 ]

As part of the shutdown command (and exitCleanly() in general), the WiredTiger storage engine is shut down before the mongod.lock file is removed. I confirmed with Keith over Slack that when WiredTiger shuts down, it has flushed all of its writes to all open files and closed those files. Those writes should also be visible before returning back to the function call in mongod. This means the unlink() of the mongod.lock file should only be visible after the all the writes to the data files.

On a related topic, I was curious to know why you had originally written the condition to check if the lock file is empty or not. Is there a case the lock file would be left behind but as an empty file?

Comment by Jonathan Abrahams [ 28/Feb/19 ]

It seemed a better way to detect that actual process has been "stopped" and would be OS-independent. Your proposal seems reasonable; I'd still be concerned about the DB files flushing being complete within 5 seconds.

Comment by Max Hirschhorn [ 28/Feb/19 ]

jonathan.abrahams, what was the reason you removed the os.path.exists(lock_file) and os.stat(lock_file).st_size version of the check in the changes from 08b2554 as part of SERVER-35506? I agree the win32serviceutil.QueryServiceStatus() version is necessary for reliably detecting when mongod has exited on Windows, but the ProcessControl.get_pids() version seems less reliable on Linux. My inclination would be to poll win32serviceutil.QueryServiceStatus() on Windows and poll os.path.exists(lock_file) and os.stat(lock_file).st_size on Linux.

Generated at Thu Feb 08 04:53:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.