[SERVER-13811] Deal better/Fail more gracefully when mongoD runs out of disk space Created: 01/May/14  Updated: 13/Dec/22

Status: Backlog
Project: Core Server
Component/s: Stability, Storage
Affects Version/s: None
Fix Version/s: None

Type: New Feature Priority: Major - P3
Reporter: Osmar Olivo Assignee: Brian Lane
Resolution: Unresolved Votes: 8
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
is duplicated by SERVER-15952 mongod hits assertion when run out of... Closed
is duplicated by SERVER-15959 Running out of disk space should not ... Closed
Related
is related to SERVER-3759 filesystem ops may cause termination ... Closed
Participants:
Case:

 Description   

Currently when a mongoD process runs out of disk space and fails to preallocate a file or write to the journal, it responds with terminating the server process.

This proves to be a difficult place to be in because the remove operation in and of itself will fail when attempting to reclaim space. Furthermore, things that write to disk temporarily like external sort or temporary agg results will also have problems with this.

A more graceful approach would be to allow us to limit mongoD space utilization to some threshold before filling the disk, so that cleanup and stabilization of the system is facilitated.

Something like "Stop accepting writes (other than removes) if less than 10% (or some number of GB) disk space available" or "If preallocation fails due to lack of space (2GB) for the final datafile, stop accepting writes aside from removes" would be much more graceful. This would of course mean $out and external sorts should fail as well. but would save from dealing with all the other issues associated with full disk.

Of course there are edge cases to be considered such as, if a secondary hits this threshold, it can no longer replicate therefore it should be marked as down or unavailable with respect to the quorum. (Which I believe already happens ) but then how do we process cleanup if it can't replicate the removes? We'll just have to increase capacity or do a full resync in situations where a secondary runs out of disk before a primary.

But for the general case, this would be a huge win, whether the number is configurable or not.



 Comments   
Comment by Steven Vannelli [ 10/May/22 ]

Moving this ticket to the Backlog and removing the "Backlog" fixVersion as per our latest policy for using fixVersions.

Comment by Kyle Mertz [ 24/Sep/18 ]

I agree with this.  The system shouldn't stop working if you run out of space.  Write should obviously fail, but reads and deletes should still be accepted.  Causing an outage when one isn't necessary is bad design.  Think of how Oracle has their archiver error when disk fills up, it still allows read and delete operations, but insert/updates will fail.  

Comment by Geert Bosch [ 01/Dec/15 ]

Example backtrace of running out of diskspace with WiredTiger as storage engine.

 WT_CURSOR.insert: index-29--7668984530888323570.wt write error: failed to write 8192 bytes at offset 7509319680: No space left on device
2015-12-01T01:48:40.653-0500 I -        [conn26] Fatal Assertion 28559
2015-12-01T01:48:40.653-0500 I -        [conn26]
 
***aborting after fassert() failure
 
 
2015-12-01T01:48:40.660-0500 F -        [conn30] Got signal: 6 (Aborted).
 
 0x12ea5c2 0x12e9719 0x12e9f32 0x7f89fc998340 0x7f89fc5f6bb9 0x7f89fc5f9fc8 0x1274c92 0x1072ef5 0x106d42c 0x1069b1c 0x10527d0 0xc8982f 0xae1549 0xae57df 0xae5880 0xac51f5 0xac5501 0xca3fb8 0xca4369 0xca4454 0xca99f6 0xcabf25 0x9966cc 0x1298145 0x7f89fc990182 0x7f89fc6bafbd
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"EEA5C2"},{"b":"400000","o":"EE9719"},{"b":"400000","o":"EE9F32"},{"b":"7F89FC988000","o":"10340"},{"b":"7F89FC5C0000","o":"36BB9"},{"b":"7F89FC5C0000","o":"39FC8"},{"b":"400000","o":"E74C92"},{"b":"400000","o":"C72EF5"},{"b":"400000","o":"C6D42C"},{"b":"400000","o":"C69B1C"},{"b":"400000","o":"C527D0"},{"b":"400000","o":"88982F"},{"b":"400000","o":"6E1549"},{"b":"400000","o":"6E57DF"},{"b":"400000","o":"6E5880"},{"b":"400000","o":"6C51F5"},{"b":"400000","o":"6C5501"},{"b":"400000","o":"8A3FB8"},{"b":"400000","o":"8A4369"},{"b":"400000","o":"8A4454"},{"b":"400000","o":"8A99F6"},{"b":"400000","o":"8ABF25"},{"b":"400000","o":"5966CC"},{"b":"400000","o":"E98145"},{"b":"7F89FC988000","o":"8182"},{"b":"7F89FC5C0000","o":"FAFBD"}],"processInfo":{ "mongodbVersion" : "3.2.0-rc4-16-g764f8a3", "gitVersion" : "764f8a33034392758d033f449d480121d3bb32e1", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "3.13.0-39-generic", "version" : "#66-Ubuntu SMP Tue Oct 28 13:30:27 UTC 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "FE28A46DBAB02AE578EDCCF59C9B4A9C75243BC7" }, { "b" : "7FFF99CE0000", "elfType" : 3, "buildId" : "0074678E5FFFF79F46C476077E67057161772F37" }, { "b" : "7F89FDBC8000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "24273411CD5FDB1E42F868F4E53513A26C404DBB" }, { "b" : "7F89FD7E8000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "98690042D55F842BD5D326A2A7234CB59FFEE78D" }, { "b" : "7F89FD5E0000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "92FCF41EFE012D6186E31A59AD05BDBB487769AB" }, { "b" : "7F89FD3D8000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "C1AE4CB7195D337A77A3C689051DABAA3980CA0C" }, { "b" : "7F89FD0C8000", "path" : "/usr/local/lib64/libstdc++.so.6", "elfType" : 3 }, { "b" : "7F89FCDC0000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "574C6350381DA194C00FF555E0C1784618C05569" }, { "b" : "7F89FCBA8000", "path" : "/usr/local/lib64/libgcc_s.so.1", "elfType" : 3 }, { "b" : "7F89FC988000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "FE662C4D7B14EE804E0C1902FB55218A106BC5CB" }, { "b" : "7F89FC5C0000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "B515361E474796AF29DE9992B76A97CFFB39B2A7" }, { "b" : "7F89FDE28000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "9F00581AB3C73E3AEA35995A0C50D24D59A01D47" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x32) [0x12ea5c2]
 mongod(+0xEE9719) [0x12e9719]
 mongod(+0xEE9F32) [0x12e9f32]
 libpthread.so.0(+0x10340) [0x7f89fc998340]
 libc.so.6(gsignal+0x39) [0x7f89fc5f6bb9]
 libc.so.6(abort+0x148) [0x7f89fc5f9fc8]
 mongod(_ZN5mongo13fassertFailedEi+0x82) [0x1274c92]
 mongod(_ZN5mongo17wtRCToStatus_slowEiPKc+0x365) [0x1072ef5]
 mongod(_ZN5mongo17WiredTigerSession13releaseCursorEmP11__wt_cursor+0x12C) [0x106d42c]
 mongod(_ZN5mongo16WiredTigerCursorD1Ev+0x1C) [0x1069b1c]
 mongod(_ZN5mongo15WiredTigerIndex6insertEPNS_16OperationContextERKNS_7BSONObjERKNS_8RecordIdEb+0xD0) [0x10527d0]
 mongod(_ZN5mongo17IndexAccessMethod6insertEPNS_16OperationContextERKNS_7BSONObjERKNS_8RecordIdERKNS_19InsertDeleteOptionsEPl+0x18F) [0xc8982f]
 mongod(_ZN5mongo12IndexCatalog21_indexFilteredRecordsEPNS_16OperationContextEPNS_17IndexCatalogEntryERKSt6vectorINS_10BsonRecordESaIS6_EE+0x109) [0xae1549]
 mongod(_ZN5mongo12IndexCatalog13_indexRecordsEPNS_16OperationContextEPNS_17IndexCatalogEntryERKSt6vectorINS_10BsonRecordESaIS6_EE+0x11F) [0xae57df]
 mongod(_ZN5mongo12IndexCatalog12indexRecordsEPNS_16OperationContextERKSt6vectorINS_10BsonRecordESaIS4_EE+0x80) [0xae5880]
 mongod(_ZN5mongo10Collection16_insertDocumentsEPNS_16OperationContextEN9__gnu_cxx17__normal_iteratorIPKNS_7BSONObjESt6vectorIS5_SaIS5_EEEESB_b+0x325) [0xac51f5]
 mongod(_ZN5mongo10Collection15insertDocumentsEPNS_16OperationContextEN9__gnu_cxx17__normal_iteratorIPKNS_7BSONObjESt6vectorIS5_SaIS5_EEEESB_bb+0x1B1) [0xac5501]
 mongod(_ZN5mongo17insertMultiVectorEPNS_16OperationContextERNS_16OldClientContextEbPKcRNS_5CurOpEN9__gnu_cxx17__normal_iteratorIPNS_7BSONObjESt6vectorISA_SaISA_EEEESF_+0x138) [0xca3fb8]
 mongod(_ZN5mongo11insertMultiEPNS_16OperationContextERNS_16OldClientContextEbPKcRSt6vectorINS_7BSONObjESaIS7_EERNS_5CurOpE+0xF9) [0xca4369]
 mongod(_ZN5mongo15_receivedInsertEPNS_16OperationContextERKNS_15NamespaceStringEPKcRSt6vectorINS_7BSONObjESaIS8_EEbRNS_5CurOpEb+0xD4) [0xca4454]
 mongod(_ZN5mongo14receivedInsertEPNS_16OperationContextERKNS_15NamespaceStringERNS_7MessageERNS_5CurOpE+0x376) [0xca99f6]
 mongod(_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x11C5) [0xcabf25]
 mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortE+0xEC) [0x9966cc]
 mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x285) [0x1298145]
 libpthread.so.0(+0x8182) [0x7f89fc990182]
 libc.so.6(clone+0x6D) [0x7f89fc6bafbd]
-----  END BACKTRACE  -----
Aborted (core dumped)

Generated at Thu Feb 08 03:32:58 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.