[SERVER-4038] mongod crashes preallocating next file when there's no more disk space available Created: 07/Oct/11  Updated: 25/Jun/15  Resolved: 24/Mar/14

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 2.0.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Daniel Pasette (Inactive) Assignee: Mathias Stearn
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-3759 filesystem ops may cause termination ... Closed
Operating System: ALL
Participants:

 Description   

I just got off a sales call with a prospect who is evaluating MongoDB.

They claim that mongod crashes when it turns out of disk space. They
are using XFS. From their description, it sounds like this is
happening when it tries to preallocate the next file to use and there
isn't enough space. I tried a quick search in Jira, but didn't see
anything obvious. Has anyone seen anything like this before, or is
this a known problem?

Dwight sez: "i think that is correct. we have not done any work on out of disk space really."



 Comments   
Comment by Mathias Stearn [ 24/Mar/14 ]

This is intentional. The specific error is when writing to the journal. If there is no room to write the journal (or writing fails for any reason) we explicitly abort the server. The rational is that is is better to crash and leave the data in a consistent state than to potentially corrupt the data.

Comment by Stennie Steneker (Inactive) [ 12/Oct/12 ]

There was a report of an ungraceful crash in 2.0.7 with stacktrace included from:
https://groups.google.com/forum/?fromgroups=#!topic/mongodb-user/PVEIn3nc8g4

Thu Sep 13 11:18:13 [FileAllocator] error failed to allocate new file: /mysqlroot/mongodb/community/community.5 size: 2146435072 errno:28 No space left on device
Thu Sep 13 11:18:13 [FileAllocator]     will try again in 10 seconds
Thu Sep 13 11:18:13 Backtrace:
0xa9609a 0x3ac52302f0 0x3ac5230285 0x3ac5231d30 0x75befd 0x777b65 0x777d9e 0x762cdc 0x76352d 0x76384d 0x76412b 0xaabdb0 0x3ac5e0677d 0x3ac52d325d 
 /usr/local/bin/mongod(_ZN5mongo10abruptQuitEi+0x3aa) [0xa9609a]
 /lib64/libc.so.6 [0x3ac52302f0]
 /lib64/libc.so.6(gsignal+0x35) [0x3ac5230285]
 /lib64/libc.so.6(abort+0x110) [0x3ac5231d30]
 /usr/local/bin/mongod(_ZN5mongo7LogFile17synchronousAppendEPKvm+0x12d) [0x75befd]
 /usr/local/bin/mongod(_ZN5mongo3dur7Journal7journalERKNS0_11JSectHeaderERKNS_14AlignedBuilderE+0x1e5) [0x777b65]
 /usr/local/bin/mongod(_ZN5mongo3dur14WRITETOJOURNALENS0_11JSectHeaderERNS_14AlignedBuilderE+0x4e) [0x777d9e]
 /usr/local/bin/mongod(_ZN5mongo3dur28_groupCommitWithLimitedLocksEv+0x24c) [0x762cdc]
 /usr/local/bin/mongod(_ZN5mongo3dur27groupCommitWithLimitedLocksEv+0x1d) [0x76352d]
 /usr/local/bin/mongod [0x76384d]
 /usr/local/bin/mongod(_ZN5mongo3dur9durThreadEv+0x10b) [0x76412b]
 /usr/local/bin/mongod(thread_proxy+0x80) [0xaabdb0]
 /lib64/libpthread.so.0 [0x3ac5e0677d]
 /lib64/libc.so.6(clone+0x6d) [0x3ac52d325d]
 
Logstream::get called in uninitialized state
Thu Sep 13 11:18:13 Invalid access at address: 0x4
 
Thu Sep 13 11:18:13 Got signal: 11 (Segmentation fault).
 
Thu Sep 13 11:18:13 Backtrace:
0xa9609a 0xa9678c 0x3ac5e0ebe0 0x54f952 0x55a864 0x55b554 0x8e07e5 0x8e13c8 0x96026e 0x96501d 0x88ce24 0x88e7cf 0xaa0b38 0x638767 0x3ac5e0677d 0x3ac52d325d 
 /usr/local/bin/mongod(_ZN5mongo10abruptQuitEi+0x3aa) [0xa9609a]
 /usr/local/bin/mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv+0x22c) [0xa9678c]
 /lib64/libpthread.so.0 [0x3ac5e0ebe0]
 /usr/local/bin/mongod(_ZN5mongo10FieldRangeC1ERKNS_11BSONElementEbbb+0x1a2) [0x54f952]
 /usr/local/bin/mongod(_ZN5mongo13FieldRangeSet17processQueryFieldERKNS_11BSONElementEb+0x84) [0x55a864]
 /usr/local/bin/mongod(_ZN5mongo13FieldRangeSetC1EPKcRKNS_7BSONObjEbb+0x194) [0x55b554]
 /usr/local/bin/mongod(_ZN5mongo16MultiPlanScannerC1EPKcRKNS_7BSONObjES5_PKNS_11BSONElementEbS5_S5_bb+0x1f5) [0x8e07e5]
 /usr/local/bin/mongod(_ZN5mongo11MultiCursorC1EPKcRKNS_7BSONObjES5_N5boost10shared_ptrINS0_8CursorOpEEEbb+0x138) [0x8e13c8]
 /usr/local/bin/mongod(_ZN5mongo14_updateObjectsEbPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugEPNS_11RemoveSaverEb+0x38e) [0x96026e]
 /usr/local/bin/mongod(_ZN5mongo13updateObjectsEPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugEb+0x13d) [0x96501d]
 /usr/local/bin/mongod(_ZN5mongo14receivedUpdateERNS_7MessageERNS_5CurOpE+0x474) [0x88ce24]
 /usr/local/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x116f) [0x88e7cf]
 /usr/local/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x78) [0xaa0b38]
 /usr/local/bin/mongod(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x287) [0x638767]
 /lib64/libpthread.so.0 [0x3ac5e0677d]
 /lib64/libc.so.6(clone+0x6d) [0x3ac52d325d]
 
Logstream::get called in uninitialized state
Thu Sep 13 11:18:13 [conn79] ERROR: Client::~Client _context should be null but is not; client:conn
Logstream::get called in uninitialized state
Thu Sep 13 11:18:13 [conn79] ERROR: Client::shutdown not called: conn

Comment by Xavier Tesch [ 24/Sep/12 ]

I believe there is a big issue there.

Real world example:

-A database with 10GB data size, 14GB storage size and mongodb files taking 30GB on disk because of big deletions.
-Disk has 1.7GB of free space.

MongoDB behaviour:

-Impossible to save anything in DB due to write lock because it cannot preallocate next file.
-mongod keeps trying to prealloc unsuccesfully.

IMHO mongod should never be in that situation where it is not functional despite having a LOT of space in already allocated files. This is a very annoying bug that should be fixed as soon as possible.

The correct behaviour would be to avoid trying to preallocate when not enough disk space is available and not locking the writes when there is enough space in already existing files.

Comment by Mathias Stearn [ 08/Oct/11 ]

see: https://jira.mongodb.org/browse/SERVER-2609

Right now we prevent taking a write lock but still allow reads after a data-file prealloc fails. We then periodically retry to allocate the file to recover if the user makes more space available. A good enhancement would be to set the new maintnance mode to force a stepdown and make the node temporarily hidden.

Comment by Dwight Merriman [ 08/Oct/11 ]

Currently, it is supposed to terminate on out of disk space. Of course that may not be ideal. If it seg faults, that is definitely a bug and should be fixed right away.

For the long term what is the right behavior:

  • could return errors on all writes (or allocating writes) forever, and reads work. wonder if this is risky if one does not notice the write errors instantly.
  • should this cause failover in a replica set?
  • idea: if a repl set there could be a low water mark where it terminates, and that low water mark be slightly different for each member, so they don't all fail at the same time if they are symmetric.
Generated at Thu Feb 08 03:04:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.