[SERVER-55515] Add Watchdog loop to thread liveness monitor Created: 25/Mar/21  Updated: 27/Oct/23  Resolved: 29/Mar/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Andrew Shuvalov (Inactive) Assignee: Backlog - Service Architecture
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Assigned Teams:
Service Arch
Participants:

 Description   

"Thread liveness monitor" is a new feature I proposed few weeks ago and shameek.ray is working on creating a PM ticket for it. Once the feature is implemented, add the monitor to Watchdog (in Enterprise module) and put the checkpoint right before open() call in checkFile().

Background

See SERVER-55510 and HELP ticket for more details. The detection of storage failure has gray areas. First, the SERVER-55510 will convert open() to nonblocking, because it is legal for a faulty "slow" storage driver to block indefinitely. Slow and fast file devices are explained here:
https://www.linuxtoday.com/blog/blocking-and-non-blocking-i-0.html

The dark corner I reproduced is based on device mapper. Device mapper is a legal production method of creating new derived devices, for example RAID. If the underlying device is a block device the mapped device is also block and is considered "fast". Fast device open() will block even if O_NONBLOCK flag is supplied. However, block mapper can be "suspended". I claim that suspend mode (or any similar unavailability) is not a hack but a valid production corner case that can result from script error, unknown Cloud management pattern, etc. If we can reproduce it with regular production kernel and regular account it's valid.

Solution

Thread liveness monitor will kill the server if open() blocks for over 5 minutes.



 Comments   
Comment by Andrew Shuvalov (Inactive) [ 29/Mar/21 ]

Reproduced that Watchdog is still handling the situation properly because of WatchdogMonitorThread was able to detect a stuck open.

Generated at Thu Feb 08 05:36:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.