Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-30961

FTDC should gracefully handle disk full

    • Type: Icon: Improvement Improvement
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Diagnostics
    • Labels:
    • Server Security

      Description

      Currently, if the dbpath filesystem fills (even briefly for a few seconds), FTDC will shutdown.

      2017-09-06T03:15:53.397+0000 W FTDC     [ftdc] Uncaught exception in 'FileStreamFailed: Failed to write to interim file buffer for full-time diagnostic data capture: /home/kev/data/db/diagnostic.data/metrics.interim.temp' in full-time diagnostic data capture subsystem. Shutting down the full-time diagnostic data capture subsystem.
      

      This is because the metrics.interim file is appended to gradually.

      $ ls -la diagnostic.data/
      total 52
      drwxrwxr-x 2 kev kev  4096 Sep  6 03:21 ./
      drwxr-xr-x 5 kev kev  4096 Sep  6 03:21 ../
      -rw-rw-r-- 1 kev kev  8257 Sep  6 03:20 metrics.interim
      -rw-rw-r-- 1 kev kev     0 Sep  6 03:21 metrics.interim.temp
      -rw-rw-r-- 1 kev kev 10953 Sep  6 03:12 metrics.2017-09-06T03-11-16Z-00000
      -rw-rw-r-- 1 kev kev  2767 Sep  6 03:15 metrics.2017-09-06T03-15-01Z-00000
      -rw-rw-r-- 1 kev kev 10395 Sep  6 03:18 metrics.2017-09-06T03-18-53Z-00000
      -rw-rw-r-- 1 kev kev  2767 Sep  6 03:19 metrics.2017-09-06T03-19-47Z-00000
      

      Furthermore, in some cases when the mongod is restarted after this, FTDC will actually shutdown again, even if disk space is now available (apparently due to some problem with the leftover metrics.interim.temp file):

      2017-09-06T03:18:54.003+0000 I FTDC     [ftdc] Unclean full-time diagnostic data capture shutdown detected, found interim file, some metrics may have been lost. OK
      2017-09-06T03:18:54.014+0000 W FTDC     [ftdc] Uncaught exception in 'UnknownError: Caught std::exception of type boost::filesystem::filesystem_error: boost::filesystem::file_size: No such file or directory: "/home/kev/data/db/diagnostic.data/metrics.interim.temp"' in full-time diagnostic data capture subsystem. Shutting down the full-time diagnostic data capture subsystem.
      

      The main problem is that FTDC is unable to record details immediately prior to (and after) the disk has completely filled. A secondary problem is that FTDC cannot be restarted without restarting the mongod (even by using setParameter to set diagnosticDataCollectionEnabled to false and then back to true). These could impact diagnosability of the disk filling event, or subsequent events.

      Suggested fix

      It would be better if FTDC preallocated space for the metrics.interim file (of diagnosticDataCollectionFileSizeMB), and then wrote inside that. When the file is renamed it can be truncated if necessary (ideally after the next interim file has been preallocated).

      Repro steps

      mlaunch init --single
      mlaunch stop
      fallocate -l 2147483648 blob
      mkfs.ext4 blob
      mv data/db data/db.orig
      mkdir db
      sudo mount -o loop blob db
      sudo chown `whoami` db
      cp -a db.orig/* db/
      mlaunch start
      tailf data/mongodb.log    # in a separate window
      dd if=/dev/zero of=db/filler
      sleep 10    # wait up to diagnosticDataCollectionSamplesPerInterimUpdate * diagnosticDataCollectionPeriodMillis (10 secs by default) for FTDC to shutdown
      rm filler    # avoid WT fassert during checkpoint
      

            Assignee:
            backlog-server-security [DO NOT USE] Backlog - Security Team
            Reporter:
            kevin.pulo@mongodb.com Kevin Pulo
            Votes:
            3 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated: