[SERVER-39883] Powercycle doesn't actually wait for the mongod process to exit during shutdown_mongod Created: 28/Feb/19 Updated: 29/Oct/23 Resolved: 24/Jun/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Testing Infrastructure |
| Affects Version/s: | None |
| Fix Version/s: | 4.4.0-rc11, 4.7.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Max Hirschhorn | Assignee: | Daniel Gottlieb (Inactive) |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | tig-powercycle | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Backport Requested: |
v4.4
|
||||||||||||||||||||
| Sprint: | STM 2020-06-29 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Linked BF Score: | 50 | ||||||||||||||||||||
| Story Points: | 3 | ||||||||||||||||||||
| Description |
|
The "shutdown_mongod" action runs the {shutdown: 1, force: true} command and then (on Linux) waits for psutil to say no processes with the name "mongod" exist. The wait_for_mongod_shutdown() function then sleeps an arbitrary extra 5 seconds in order to wait for any pending I/O to finish. It possible for 5 seconds to not be long enough where a file will disappear when running rsync or the mongod process will fail to start. |
| Comments |
| Comment by Githook User [ 24/Jun/20 ] |
|
Author: {'name': 'Daniel Gottlieb', 'email': 'daniel.gottlieb@mongodb.com', 'username': 'dgottlieb'}Message: (cherry picked from commit 128ea14211bc3b0925c5788a848b3c696743f540) |
| Comment by Githook User [ 24/Jun/20 ] |
|
Author: {'name': 'Daniel Gottlieb', 'email': 'daniel.gottlieb@mongodb.com', 'username': 'dgottlieb'}Message: |
| Comment by Daniel Gottlieb (Inactive) [ 18/Jun/20 ] |
|
Thanks for the heads up bruce.lucas. |
| Comment by Bruce Lucas (Inactive) [ 18/Jun/20 ] |
|
There was also an issue causing slow shutdown that was fixed by |
| Comment by Eric Milkie [ 17/Jun/20 ] |
|
The shutdown slowness might be related to |
| Comment by Daniel Gottlieb (Inactive) [ 17/Jun/20 ] |
|
ian.whalen I've been able to fix exactly what the ticket is describing (I believe), but we're still seeing failures for some of the tasks. WT shutdown does much more work than it used to, the test is causing WT to take in excess of 10-20 minutes to shutdown. I don't know how much I can realistically bump up timeouts (even finding the right timeout to bump is a challenge). |
| Comment by Ian Whalen (Inactive) [ 17/Jun/20 ] |
|
Marking this as 4.4.0 since it is presumably a release blocker that power cycle is failing 100%. |
| Comment by Jonathan Abrahams [ 05/Mar/19 ] |
|
Per discussion, the logic to check for mongod.lock was removed when refactoring for an OS-independent solution. However, if it makes sense to do it for Linux & Windows differently then we should go with your approach. |
| Comment by Max Hirschhorn [ 01/Mar/19 ] |
|
As part of the shutdown command (and exitCleanly() in general), the WiredTiger storage engine is shut down before the mongod.lock file is removed. I confirmed with Keith over Slack that when WiredTiger shuts down, it has flushed all of its writes to all open files and closed those files. Those writes should also be visible before returning back to the function call in mongod. This means the unlink() of the mongod.lock file should only be visible after the all the writes to the data files. On a related topic, I was curious to know why you had originally written the condition to check if the lock file is empty or not. Is there a case the lock file would be left behind but as an empty file? |
| Comment by Jonathan Abrahams [ 28/Feb/19 ] |
|
It seemed a better way to detect that actual process has been "stopped" and would be OS-independent. Your proposal seems reasonable; I'd still be concerned about the DB files flushing being complete within 5 seconds. |
| Comment by Max Hirschhorn [ 28/Feb/19 ] |
|
jonathan.abrahams, what was the reason you removed the os.path.exists(lock_file) and os.stat(lock_file).st_size version of the check in the changes from 08b2554 as part of |