-
Type: Improvement
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Diagnostics
-
Server Security
Description
Currently, if the dbpath filesystem fills (even briefly for a few seconds), FTDC will shutdown.
2017-09-06T03:15:53.397+0000 W FTDC [ftdc] Uncaught exception in 'FileStreamFailed: Failed to write to interim file buffer for full-time diagnostic data capture: /home/kev/data/db/diagnostic.data/metrics.interim.temp' in full-time diagnostic data capture subsystem. Shutting down the full-time diagnostic data capture subsystem.
This is because the metrics.interim file is appended to gradually.
$ ls -la diagnostic.data/ total 52 drwxrwxr-x 2 kev kev 4096 Sep 6 03:21 ./ drwxr-xr-x 5 kev kev 4096 Sep 6 03:21 ../ -rw-rw-r-- 1 kev kev 8257 Sep 6 03:20 metrics.interim -rw-rw-r-- 1 kev kev 0 Sep 6 03:21 metrics.interim.temp -rw-rw-r-- 1 kev kev 10953 Sep 6 03:12 metrics.2017-09-06T03-11-16Z-00000 -rw-rw-r-- 1 kev kev 2767 Sep 6 03:15 metrics.2017-09-06T03-15-01Z-00000 -rw-rw-r-- 1 kev kev 10395 Sep 6 03:18 metrics.2017-09-06T03-18-53Z-00000 -rw-rw-r-- 1 kev kev 2767 Sep 6 03:19 metrics.2017-09-06T03-19-47Z-00000
Furthermore, in some cases when the mongod is restarted after this, FTDC will actually shutdown again, even if disk space is now available (apparently due to some problem with the leftover metrics.interim.temp file):
2017-09-06T03:18:54.003+0000 I FTDC [ftdc] Unclean full-time diagnostic data capture shutdown detected, found interim file, some metrics may have been lost. OK 2017-09-06T03:18:54.014+0000 W FTDC [ftdc] Uncaught exception in 'UnknownError: Caught std::exception of type boost::filesystem::filesystem_error: boost::filesystem::file_size: No such file or directory: "/home/kev/data/db/diagnostic.data/metrics.interim.temp"' in full-time diagnostic data capture subsystem. Shutting down the full-time diagnostic data capture subsystem.
The main problem is that FTDC is unable to record details immediately prior to (and after) the disk has completely filled. A secondary problem is that FTDC cannot be restarted without restarting the mongod (even by using setParameter to set diagnosticDataCollectionEnabled to false and then back to true). These could impact diagnosability of the disk filling event, or subsequent events.
Suggested fix
It would be better if FTDC preallocated space for the metrics.interim file (of diagnosticDataCollectionFileSizeMB), and then wrote inside that. When the file is renamed it can be truncated if necessary (ideally after the next interim file has been preallocated).
Repro steps
mlaunch init --single mlaunch stop fallocate -l 2147483648 blob mkfs.ext4 blob mv data/db data/db.orig mkdir db sudo mount -o loop blob db sudo chown `whoami` db cp -a db.orig/* db/ mlaunch start tailf data/mongodb.log # in a separate window dd if=/dev/zero of=db/filler sleep 10 # wait up to diagnosticDataCollectionSamplesPerInterimUpdate * diagnosticDataCollectionPeriodMillis (10 secs by default) for FTDC to shutdown rm filler # avoid WT fassert during checkpoint