Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 3.2.16, 3.4.7, 3.5.10
Affects Version/s: None
Component/s: None
Labels:
None

Backwards Compatibility:
Fully Compatible
Backport Requested:

v3.4, v3.2
Sprint:
Platforms 2017-07-10, Platforms 2017-07-31
Case:
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Issue Status as of Jul 13, 2017

FEATURE DESCRIPTION
The Storage Node Watchdog is a new feature in MongoDB designed to detect unresponsive I/O conditions.

VERSIONS
This is an enterprise only feature in MongoDB, available in the 3.2.16 and 3.4.7 and newer production releases. The Watchdog is not available on macOS.

OPERATION
The Storage Node Watchdog is disabled by default:

It must be enabled at startup as follows:
```
mongod --setParameter watchdogPeriodSeconds=60
```
The watchdogPeriodSeconds parameter is an integer number of seconds and can be either -1 (the default value), which means the watchdog is disabled, or a value greater or equal to 60.

The watchdog may be paused at runtime by setting watchdogPeriodSeconds to -1 via the setParameter command:
```
MongoDB Enterprise> db.runCommand({setParameter:1, watchdogPeriodSeconds : -1})
```
The watchdog may be resumed at runtime or its period changed by setting watchdogPeriodSeconds to a value >= 60:
```
MongoDB Enterprise> db.runCommand({setParameter:1, watchdogPeriodSeconds : 120})
```

It is an error to set watchdogPeriodSeconds at runtime if the server was not started with a value >= 60 at startup.

The watchdog monitors the following directories:

The --dbpath directory
The --dbpath/journal directory if the journal is enabled
The directory of --logpath file
The directory of --auditPath file

If any of these directories resides in an I/O subsystem and that I/O subsystem becomes unresponsive, the watchdog will detect such condition after sufficient time has passed, then terminate mongod tearing down all its threads and exiting the process with exit code 61. The maximum time the watchdog can take to detect an unresponsive I/O subsystem is approximately twice the watchdogPeriodSeconds.

IMPLEMENTATION DETAILS
It is implemented as a pair of threads in mongod that monitors various directories MongoDB uses to store data, and log files. One thread checks the monitored directories, and a second thread ensures that the first thread never gets stuck. The check thread runs at a fixed 10 second interval.

DIAGNOSTICS
When enabled, the watchdog logs all changes to watchdogPeriodSeconds at the default log level.

When enabled at startup, the following message will appear in the logs:

CONTROL  [initandlisten] Starting Watchdog Monitor

If watchdogPeriodSeconds is disabled or changed at runtime, messages like the following will appear in the logs:

CONTROL  [initandlisten] WatchdogMonitor disabled
CONTROL  [initandlisten] WatchdogMonitor period changed to 120s

At log level 1, the watchdog logs its periodic disk checks:

CONTROL  [watchdogCheck] Watchdog test 'checked directory '/data/db/'' took 3ms

If the watchdog was enabled at startup, an additional section is added to the output of the serverStatus command output named "watchdog".

MongoDB Enterprise > db.serverStatus()
        ...
        "watchdog" : {
                "checkGeneration" : NumberLong(2),
                "monitorGeneration" : NumberLong(0),
                "monitorPeriod" : 120
        },
        ...

The meaning of this data is:

checkGeneration: 64-bit signed integer; indicates the number of directory checks run since startup. It increments once for each directory checked. For example, if dbpath and logpath are specified, then this field is incremented twice every 10 seconds.
monitorGeneration: 64-bit signed integer; number of times the check thread has been checked for progress.
monitorPeriod: 64-bit signed integer; the value of the watchdogPeriodSeconds parameter.

TEST METHODOLOGY
We use CharybdeFS, a Linux FUSE file system from ScyllaDB to create unresponsive I/O conditions and verify the storage watchdog detects them.

Original description

Implement a storage node watchdog for Linux.

is duplicated by

SERVER-14139 Disk failure on one node can (eventually) block a whole cluster

Closed

is related to

SERVER-30774 Add Storage Node Watchdog to MongoS

Closed

related to

SERVER-31457 Mongod stop responding, takes 200 load and don't even switch to secondary

Closed

Assignee:: Mark Benvenuto
Reporter:: Mark Benvenuto
Participants:: Githook User, Mark Benvenuto
Votes:: 0 Vote for this issue
Watchers:: 16 Start watching this issue

Created:: Jun 30 2017 09:31:39 PM UTC
Updated:: Nov 08 2024 11:32:28 PM UTC
Resolved:: Jul 12 2017 02:15:44 PM UTC

Details

Description

Original description

Attachments

Issue Links

Activity

People

Dates