[SERVER-29947] Implement Storage Node Watchdog Created: 30/Jun/17  Updated: 30/Oct/23  Resolved: 12/Jul/17

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 3.2.16, 3.4.7, 3.5.10

Type: Task Priority: Major - P3
Reporter: Mark Benvenuto Assignee: Mark Benvenuto
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Documented
is documented by DOCS-10520 Docs for SERVER-29947: Implement Stor... Closed
Duplicate
is duplicated by SERVER-14139 Disk failure on one node can (eventua... Closed
Related
related to SERVER-31457 Mongod stop responding, takes 200 loa... Closed
is related to SERVER-30774 Add Storage Node Watchdog to MongoS Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v3.4, v3.2
Sprint: Platforms 2017-07-10, Platforms 2017-07-31
Participants:
Case:

 Description   
Issue Status as of Jul 13, 2017

FEATURE DESCRIPTION
The Storage Node Watchdog is a new feature in MongoDB designed to detect unresponsive I/O conditions.

VERSIONS
This is an enterprise only feature in MongoDB, available in the 3.2.16 and 3.4.7 and newer production releases. The Watchdog is not available on macOS.

OPERATION
The Storage Node Watchdog is disabled by default:

  • It must be enabled at startup as follows:

    mongod --setParameter watchdogPeriodSeconds=60
    

    The watchdogPeriodSeconds parameter is an integer number of seconds and can be either -1 (the default value), which means the watchdog is disabled, or a value greater or equal to 60.

  • The watchdog may be paused at runtime by setting watchdogPeriodSeconds to -1 via the setParameter command:

    MongoDB Enterprise> db.runCommand({setParameter:1, watchdogPeriodSeconds : -1})
    

  • The watchdog may be resumed at runtime or its period changed by setting watchdogPeriodSeconds to a value >= 60:

    MongoDB Enterprise> db.runCommand({setParameter:1, watchdogPeriodSeconds : 120})
    

It is an error to set watchdogPeriodSeconds at runtime if the server was not started with a value >= 60 at startup.

The watchdog monitors the following directories:

  • The --dbpath directory
  • The --dbpath/journal directory if the journal is enabled
  • The directory of --logpath file
  • The directory of --auditPath file

If any of these directories resides in an I/O subsystem and that I/O subsystem becomes unresponsive, the watchdog will detect such condition after sufficient time has passed, then terminate mongod tearing down all its threads and exiting the process with exit code 61. The maximum time the watchdog can take to detect an unresponsive I/O subsystem is approximately twice the watchdogPeriodSeconds.

IMPLEMENTATION DETAILS
It is implemented as a pair of threads in mongod that monitors various directories MongoDB uses to store data, and log files. One thread checks the monitored directories, and a second thread ensures that the first thread never gets stuck. The check thread runs at a fixed 10 second interval.

DIAGNOSTICS
When enabled, the watchdog logs all changes to watchdogPeriodSeconds at the default log level.

When enabled at startup, the following message will appear in the logs:

CONTROL  [initandlisten] Starting Watchdog Monitor

If watchdogPeriodSeconds is disabled or changed at runtime, messages like the following will appear in the logs:

CONTROL  [initandlisten] WatchdogMonitor disabled
CONTROL  [initandlisten] WatchdogMonitor period changed to 120s

At log level 1, the watchdog logs its periodic disk checks:

CONTROL  [watchdogCheck] Watchdog test 'checked directory '/data/db/'' took 3ms

If the watchdog was enabled at startup, an additional section is added to the output of the serverStatus command output named "watchdog".

MongoDB Enterprise > db.serverStatus()
        ...
        "watchdog" : {
                "checkGeneration" : NumberLong(2),
                "monitorGeneration" : NumberLong(0),
                "monitorPeriod" : 120
        },
        ...

The meaning of this data is:

  • checkGeneration: 64-bit signed integer; indicates the number of directory checks run since startup. It increments once for each directory checked. For example, if dbpath and logpath are specified, then this field is incremented twice every 10 seconds.
  • monitorGeneration: 64-bit signed integer; number of times the check thread has been checked for progress.
  • monitorPeriod: 64-bit signed integer; the value of the watchdogPeriodSeconds parameter.

TEST METHODOLOGY
We use CharybdeFS, a Linux FUSE file system from ScyllaDB to create unresponsive I/O conditions and verify the storage watchdog detects them.

Original description

Implement a storage node watchdog for Linux.



 Comments   
Comment by Githook User [ 13/Jul/17 ]

Author:

{u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}

Message: SERVER-29947 Fix compile on Enterprise OSX by using osx instead of darwin
Branch: v3.4
https://github.com/10gen/mongo-enterprise-modules/commit/d3bd790df1209c0bc4ebe950715b526b35ae6129

Comment by Githook User [ 13/Jul/17 ]

Author:

{u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}

Message: SERVER-29947 Fix compile on Enterprise OSX by using osx instead of darwin
Branch: v3.2
https://github.com/10gen/mongo-enterprise-modules/commit/ea3a9c12c07d0e56d3345bc753fd332e1e2b4dae

Comment by Githook User [ 13/Jul/17 ]

Author:

{u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}

Message: SERVER-29947 Implement Storage Node Watchdog

(cherry picked from commit 63379590ef6fad402b17464c8ca5ad4c09a626d3)
Branch: v3.2
https://github.com/10gen/mongo-enterprise-modules/commit/e597aee2136d8f539f7eea11e338dd19152770df

Comment by Githook User [ 13/Jul/17 ]

Author:

{u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}

Message: SERVER-29947 Implement Storage Node Watchdog

(cherry picked from commit eb333b92cae5e71affb0fe76cd388801afa8e79f)
Branch: v3.2
https://github.com/mongodb/mongo/commit/390d7e7290ccc51e99c902a8344ef2e0c60001cb

Comment by Githook User [ 13/Jul/17 ]

Author:

{u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}

Message: SERVER-29947 Implement Storage Node Watchdog

(cherry picked from commit 63379590ef6fad402b17464c8ca5ad4c09a626d3)
Branch: v3.4
https://github.com/10gen/mongo-enterprise-modules/commit/298869ac8df9186d599a1829e455ee559a3df45b

Comment by Githook User [ 13/Jul/17 ]

Author:

{u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}

Message: SERVER-29947 Implement Storage Node Watchdog

(cherry picked from commit eb333b92cae5e71affb0fe76cd388801afa8e79f)
Branch: v3.4
https://github.com/mongodb/mongo/commit/a889b0d79a17eeed1f548a227e13ad553d1b32a2

Comment by Githook User [ 12/Jul/17 ]

Author:

{u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}

Message: SERVER-29947 Implement Storage Node Watchdog
Branch: master
https://github.com/10gen/mongo-enterprise-modules/commit/63379590ef6fad402b17464c8ca5ad4c09a626d3

Comment by Githook User [ 12/Jul/17 ]

Author:

{u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}

Message: SERVER-29947 Implement Storage Node Watchdog
Branch: master
https://github.com/mongodb/mongo/commit/eb333b92cae5e71affb0fe76cd388801afa8e79f

Generated at Thu Feb 08 04:22:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.