Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-29947

Implement Storage Node Watchdog

    • Type: Icon: Task Task
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 3.2.16, 3.4.7, 3.5.10
    • Affects Version/s: None
    • Component/s: None
    • Labels:
      None
    • Fully Compatible
    • v3.4, v3.2
    • Platforms 2017-07-10, Platforms 2017-07-31

      Issue Status as of Jul 13, 2017

      FEATURE DESCRIPTION
      The Storage Node Watchdog is a new feature in MongoDB designed to detect unresponsive I/O conditions.

      VERSIONS
      This is an enterprise only feature in MongoDB, available in the 3.2.16 and 3.4.7 and newer production releases. The Watchdog is not available on macOS.

      OPERATION
      The Storage Node Watchdog is disabled by default:

      • It must be enabled at startup as follows:
        mongod --setParameter watchdogPeriodSeconds=60
        

        The watchdogPeriodSeconds parameter is an integer number of seconds and can be either -1 (the default value), which means the watchdog is disabled, or a value greater or equal to 60.

      • The watchdog may be paused at runtime by setting watchdogPeriodSeconds to -1 via the setParameter command:
        MongoDB Enterprise> db.runCommand({setParameter:1, watchdogPeriodSeconds : -1})
        
      • The watchdog may be resumed at runtime or its period changed by setting watchdogPeriodSeconds to a value >= 60:
        MongoDB Enterprise> db.runCommand({setParameter:1, watchdogPeriodSeconds : 120})
        

      It is an error to set watchdogPeriodSeconds at runtime if the server was not started with a value >= 60 at startup.

      The watchdog monitors the following directories:

      • The --dbpath directory
      • The --dbpath/journal directory if the journal is enabled
      • The directory of --logpath file
      • The directory of --auditPath file

      If any of these directories resides in an I/O subsystem and that I/O subsystem becomes unresponsive, the watchdog will detect such condition after sufficient time has passed, then terminate mongod tearing down all its threads and exiting the process with exit code 61. The maximum time the watchdog can take to detect an unresponsive I/O subsystem is approximately twice the watchdogPeriodSeconds.

      IMPLEMENTATION DETAILS
      It is implemented as a pair of threads in mongod that monitors various directories MongoDB uses to store data, and log files. One thread checks the monitored directories, and a second thread ensures that the first thread never gets stuck. The check thread runs at a fixed 10 second interval.

      DIAGNOSTICS
      When enabled, the watchdog logs all changes to watchdogPeriodSeconds at the default log level.

      When enabled at startup, the following message will appear in the logs:

      CONTROL  [initandlisten] Starting Watchdog Monitor
      

      If watchdogPeriodSeconds is disabled or changed at runtime, messages like the following will appear in the logs:

      CONTROL  [initandlisten] WatchdogMonitor disabled
      CONTROL  [initandlisten] WatchdogMonitor period changed to 120s
      

      At log level 1, the watchdog logs its periodic disk checks:

      CONTROL  [watchdogCheck] Watchdog test 'checked directory '/data/db/'' took 3ms
      

      If the watchdog was enabled at startup, an additional section is added to the output of the serverStatus command output named "watchdog".

      MongoDB Enterprise > db.serverStatus()
              ...
              "watchdog" : {
                      "checkGeneration" : NumberLong(2),
                      "monitorGeneration" : NumberLong(0),
                      "monitorPeriod" : 120
              },
              ...
      

      The meaning of this data is:

      • checkGeneration: 64-bit signed integer; indicates the number of directory checks run since startup. It increments once for each directory checked. For example, if dbpath and logpath are specified, then this field is incremented twice every 10 seconds.
      • monitorGeneration: 64-bit signed integer; number of times the check thread has been checked for progress.
      • monitorPeriod: 64-bit signed integer; the value of the watchdogPeriodSeconds parameter.

      TEST METHODOLOGY
      We use CharybdeFS, a Linux FUSE file system from ScyllaDB to create unresponsive I/O conditions and verify the storage watchdog detects them.

      Original description

      Implement a storage node watchdog for Linux.

            Assignee:
            mark.benvenuto@mongodb.com Mark Benvenuto
            Reporter:
            mark.benvenuto@mongodb.com Mark Benvenuto
            Votes:
            0 Vote for this issue
            Watchers:
            16 Start watching this issue

              Created:
              Updated:
              Resolved: