Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 4.4.0-rc0, 4.7.0
Affects Version/s: None
Component/s: Logging
Labels:
None

Backwards Compatibility:
Fully Compatible
Backport Requested:

v4.4
Sprint:
Execution Team 2020-04-06, Dev Tools 2020-03-09
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Currently, any failure to append a log message is silently ignored (even though the appenders report failures). (Note that LogDomain's _abortOnFailure is initialised as false and only set to true for the audit log.)

This isn't great, because it means that any given logfile might not actually be a complete record of the server's behaviour, which is an assumption that logfile analysis very frequently relies on. Also, in the event of a permanent logging failure (eg. the logfile filesystem goes read-only), the server will continue to run (potentially for a long period of time) with no indication of any problem.

Possible solutions might include:

Abort the server if a log message append fails.
- Probably with a special exit code.
- Possibly after outputting a/the message somewhere else that might be able to accept writes, eg:
  - stderr,
  - a special file in the dbpath or /tmp,
  - to syslog (if using logfile or console MongoDB logging).

Increment a metric on log message append failure, and then expose it and/or act on it. For example:
- Expose it in serverStatus so that FTDC, the Cloud/Atlas monitoring agent, and free cloud monitoring (and possibly also the shell) can pick it up easily.
- Increment of this metric could trigger replset primary stepdown.
- Non-zero metric value could make the replset member effectively priority: 0 (ie. not eligible for election/votes).
- The metric could be exposed to other members in replset heartbeats, and the other nodes could then log if they observe this value becoming non-zero, or increase, or otherwise log periodically whilst it's non-zero.
- Adjust the storage watchdog to also monitor this metric, with repeated incrementing over several watchdog periods leading to server shutdown. (In an attempt to catch permanent failures, while tolerating small brief transient failures.)

Assignee:: Henrik Edin
Reporter:: Kevin Pulo
Participants:: Githook User, Henrik Edin, Kevin Pulo, Matt Lord
Votes:: 1 Vote for this issue
Watchers:: 12 Start watching this issue

Created:: Nov 23 2018 02:10:59 AM UTC
Updated:: Oct 29 2023 10:26:24 PM UTC
Resolved:: Mar 24 2020 02:07:03 PM UTC

Details

Description

Attachments

Activity

People

Dates