[SERVER-38233] Logging should not silently ignore errors Created: 23/Nov/18  Updated: 29/Oct/23  Resolved: 24/Mar/20

Status: Closed
Project: Core Server
Component/s: Logging
Affects Version/s: None
Fix Version/s: 4.4.0-rc0, 4.7.0

Type: Improvement Priority: Major - P3
Reporter: Kevin Pulo Assignee: Henrik Edin
Resolution: Fixed Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.4
Sprint: Execution Team 2020-04-06, Dev Tools 2020-03-09
Participants:

 Description   

Currently, any failure to append a log message is silently ignored (even though the appenders report failures). (Note that LogDomain's _abortOnFailure is initialised as false and only set to true for the audit log.)

This isn't great, because it means that any given logfile might not actually be a complete record of the server's behaviour, which is an assumption that logfile analysis very frequently relies on. Also, in the event of a permanent logging failure (eg. the logfile filesystem goes read-only), the server will continue to run (potentially for a long period of time) with no indication of any problem.

Possible solutions might include:

  • Abort the server if a log message append fails.
    • Probably with a special exit code.
    • Possibly after outputting a/the message somewhere else that might be able to accept writes, eg:
      • stderr,
      • a special file in the dbpath or /tmp,
      • to syslog (if using logfile or console MongoDB logging).
  • Increment a metric on log message append failure, and then expose it and/or act on it. For example:
    • Expose it in serverStatus so that FTDC, the Cloud/Atlas monitoring agent, and free cloud monitoring (and possibly also the shell) can pick it up easily.
    • Increment of this metric could trigger replset primary stepdown.
    • Non-zero metric value could make the replset member effectively priority: 0 (ie. not eligible for election/votes).
    • The metric could be exposed to other members in replset heartbeats, and the other nodes could then log if they observe this value becoming non-zero, or increase, or otherwise log periodically whilst it's non-zero.
    • Adjust the storage watchdog to also monitor this metric, with repeated incrementing over several watchdog periods leading to server shutdown. (In an attempt to catch permanent failures, while tolerating small brief transient failures.)


 Comments   
Comment by Githook User [ 26/Mar/20 ]

Author:

{'name': 'Henrik Edin', 'username': 'henrikedin', 'email': 'henrik.edin@mongodb.com'}

Message: SERVER-38233 Abort if we fail to write log to output stream

(cherry picked from commit e6e75a8bb7c95cca2a5f7ed028d497efbfe51078)
Branch: v4.4
https://github.com/mongodb/mongo/commit/d449f5ed8dadc77f90163fde0cbd103f0fbb4073

Comment by Githook User [ 24/Mar/20 ]

Author:

{'email': 'henrik.edin@mongodb.com', 'name': 'Henrik Edin', 'username': 'henrikedin'}

Message: SERVER-38233 Abort if we fail to write log to output stream
Branch: master
https://github.com/mongodb/mongo/commit/e6e75a8bb7c95cca2a5f7ed028d497efbfe51078

Comment by Matt Lord (Inactive) [ 04/Dec/18 ]

I think that a new parameter called ~ abortOnLogFailure, defaulting to off, would make sense. 

Generated at Thu Feb 08 04:48:21 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.