-
Type: Task
-
Resolution: Works as Designed
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Labels:None
-
Service Arch
"Thread liveness monitor" is a new feature I proposed few weeks ago and shameek.ray is working on creating a PM ticket for it. Once the feature is implemented, add the monitor to Watchdog (in Enterprise module) and put the checkpoint right before open() call in checkFile().
Background
See SERVER-55510 and HELP ticket for more details. The detection of storage failure has gray areas. First, the SERVER-55510 will convert open() to nonblocking, because it is legal for a faulty "slow" storage driver to block indefinitely. Slow and fast file devices are explained here:
https://www.linuxtoday.com/blog/blocking-and-non-blocking-i-0.html
The dark corner I reproduced is based on device mapper. Device mapper is a legal production method of creating new derived devices, for example RAID. If the underlying device is a block device the mapped device is also block and is considered "fast". Fast device open() will block even if O_NONBLOCK flag is supplied. However, block mapper can be "suspended". I claim that suspend mode (or any similar unavailability) is not a hack but a valid production corner case that can result from script error, unknown Cloud management pattern, etc. If we can reproduce it with regular production kernel and regular account it's valid.
Solution
Thread liveness monitor will kill the server if open() blocks for over 5 minutes.