Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-55515

Add Watchdog loop to thread liveness monitor

    • Type: Icon: Task Task
    • Resolution: Works as Designed
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Labels:
      None
    • Service Arch

      "Thread liveness monitor" is a new feature I proposed few weeks ago and shameek.ray is working on creating a PM ticket for it. Once the feature is implemented, add the monitor to Watchdog (in Enterprise module) and put the checkpoint right before open() call in checkFile().

      Background

      See SERVER-55510 and HELP ticket for more details. The detection of storage failure has gray areas. First, the SERVER-55510 will convert open() to nonblocking, because it is legal for a faulty "slow" storage driver to block indefinitely. Slow and fast file devices are explained here:
      https://www.linuxtoday.com/blog/blocking-and-non-blocking-i-0.html

      The dark corner I reproduced is based on device mapper. Device mapper is a legal production method of creating new derived devices, for example RAID. If the underlying device is a block device the mapped device is also block and is considered "fast". Fast device open() will block even if O_NONBLOCK flag is supplied. However, block mapper can be "suspended". I claim that suspend mode (or any similar unavailability) is not a hack but a valid production corner case that can result from script error, unknown Cloud management pattern, etc. If we can reproduce it with regular production kernel and regular account it's valid.

      Solution

      Thread liveness monitor will kill the server if open() blocks for over 5 minutes.

            Assignee:
            backlog-server-servicearch [DO NOT USE] Backlog - Service Architecture
            Reporter:
            andrew.shuvalov@mongodb.com Andrew Shuvalov (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: