[SERVER-30961] FTDC should gracefully handle disk full Created: 06/Sep/17  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Diagnostics
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Kevin Pulo Assignee: Backlog - Security Team
Resolution: Unresolved Votes: 3
Labels: SWDI, move-sec
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Assigned Teams:
Server Security
Participants:

 Description   

Description

Currently, if the dbpath filesystem fills (even briefly for a few seconds), FTDC will shutdown.

2017-09-06T03:15:53.397+0000 W FTDC     [ftdc] Uncaught exception in 'FileStreamFailed: Failed to write to interim file buffer for full-time diagnostic data capture: /home/kev/data/db/diagnostic.data/metrics.interim.temp' in full-time diagnostic data capture subsystem. Shutting down the full-time diagnostic data capture subsystem.

This is because the metrics.interim file is appended to gradually.

$ ls -la diagnostic.data/
total 52
drwxrwxr-x 2 kev kev  4096 Sep  6 03:21 ./
drwxr-xr-x 5 kev kev  4096 Sep  6 03:21 ../
-rw-rw-r-- 1 kev kev  8257 Sep  6 03:20 metrics.interim
-rw-rw-r-- 1 kev kev     0 Sep  6 03:21 metrics.interim.temp
-rw-rw-r-- 1 kev kev 10953 Sep  6 03:12 metrics.2017-09-06T03-11-16Z-00000
-rw-rw-r-- 1 kev kev  2767 Sep  6 03:15 metrics.2017-09-06T03-15-01Z-00000
-rw-rw-r-- 1 kev kev 10395 Sep  6 03:18 metrics.2017-09-06T03-18-53Z-00000
-rw-rw-r-- 1 kev kev  2767 Sep  6 03:19 metrics.2017-09-06T03-19-47Z-00000

Furthermore, in some cases when the mongod is restarted after this, FTDC will actually shutdown again, even if disk space is now available (apparently due to some problem with the leftover metrics.interim.temp file):

2017-09-06T03:18:54.003+0000 I FTDC     [ftdc] Unclean full-time diagnostic data capture shutdown detected, found interim file, some metrics may have been lost. OK
2017-09-06T03:18:54.014+0000 W FTDC     [ftdc] Uncaught exception in 'UnknownError: Caught std::exception of type boost::filesystem::filesystem_error: boost::filesystem::file_size: No such file or directory: "/home/kev/data/db/diagnostic.data/metrics.interim.temp"' in full-time diagnostic data capture subsystem. Shutting down the full-time diagnostic data capture subsystem.

The main problem is that FTDC is unable to record details immediately prior to (and after) the disk has completely filled. A secondary problem is that FTDC cannot be restarted without restarting the mongod (even by using setParameter to set diagnosticDataCollectionEnabled to false and then back to true). These could impact diagnosability of the disk filling event, or subsequent events.

Suggested fix

It would be better if FTDC preallocated space for the metrics.interim file (of diagnosticDataCollectionFileSizeMB), and then wrote inside that. When the file is renamed it can be truncated if necessary (ideally after the next interim file has been preallocated).

Repro steps

mlaunch init --single
mlaunch stop
fallocate -l 2147483648 blob
mkfs.ext4 blob
mv data/db data/db.orig
mkdir db
sudo mount -o loop blob db
sudo chown `whoami` db
cp -a db.orig/* db/
mlaunch start
tailf data/mongodb.log    # in a separate window
dd if=/dev/zero of=db/filler
sleep 10    # wait up to diagnosticDataCollectionSamplesPerInterimUpdate * diagnosticDataCollectionPeriodMillis (10 secs by default) for FTDC to shutdown
rm filler    # avoid WT fassert during checkpoint



 Comments   
Comment by Ruslan Zarifov [ 11/Nov/19 ]

Hey guys. Not sure if it is related but my mongodb server of version 4.2.0 failed recently because of:

Uncaught exception in 'FileStreamFailed: Failed to write to archive file buffer for root@JenkinsCI:/var/log/mongodb

Couldn't find any information about this issue but found this page. So I'll leave here a comment.
If I can help in any way feel free to ask.

And btw my disk weren't full when it happened. It still had 16 gb of space at it's disposal.

Generated at Thu Feb 08 04:25:34 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.