Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-38233

Logging should not silently ignore errors

    • Type: Icon: Improvement Improvement
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 4.4.0-rc0, 4.7.0
    • Affects Version/s: None
    • Component/s: Logging
    • Labels:
      None
    • Fully Compatible
    • v4.4
    • Execution Team 2020-04-06, Dev Tools 2020-03-09

      Currently, any failure to append a log message is silently ignored (even though the appenders report failures). (Note that LogDomain's _abortOnFailure is initialised as false and only set to true for the audit log.)

      This isn't great, because it means that any given logfile might not actually be a complete record of the server's behaviour, which is an assumption that logfile analysis very frequently relies on. Also, in the event of a permanent logging failure (eg. the logfile filesystem goes read-only), the server will continue to run (potentially for a long period of time) with no indication of any problem.

      Possible solutions might include:

      • Abort the server if a log message append fails.
        • Probably with a special exit code.
        • Possibly after outputting a/the message somewhere else that might be able to accept writes, eg:
          • stderr,
          • a special file in the dbpath or /tmp,
          • to syslog (if using logfile or console MongoDB logging).
      • Increment a metric on log message append failure, and then expose it and/or act on it. For example:
        • Expose it in serverStatus so that FTDC, the Cloud/Atlas monitoring agent, and free cloud monitoring (and possibly also the shell) can pick it up easily.
        • Increment of this metric could trigger replset primary stepdown.
        • Non-zero metric value could make the replset member effectively priority: 0 (ie. not eligible for election/votes).
        • The metric could be exposed to other members in replset heartbeats, and the other nodes could then log if they observe this value becoming non-zero, or increase, or otherwise log periodically whilst it's non-zero.
        • Adjust the storage watchdog to also monitor this metric, with repeated incrementing over several watchdog periods leading to server shutdown. (In an attempt to catch permanent failures, while tolerating small brief transient failures.)

            Assignee:
            henrik.edin@mongodb.com Henrik Edin
            Reporter:
            kevin.pulo@mongodb.com Kevin Pulo
            Votes:
            1 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated:
              Resolved: