Details
-
Bug
-
Resolution: Incomplete
-
Critical - P2
-
None
-
2.0.6
-
None
-
ALL
-
Description
On November 3, mongo1, a hidden member of the replica set had mongod go unresponsive. Examining system logs, we found thousands of messages like:
Nov 3 17:20:47 276036-mongo1 kernel: EXT4-fs: Can't allocate: Allocation context details:
Nov 3 17:20:47 276036-mongo1 kernel: EXT4-fs: status 1 flags 3104
Nov 3 17:20:47 276036-mongo1 kernel: EXT4-fs: orig 1598/525/1@477864, goal 1598/515/2048@477184, best 0/0/0@477864 cr 3
Nov 3 17:20:47 276036-mongo1 kernel: EXT4-fs: 0 scanned, 0 found
The only EXT4 filesystem on the server is the mongo partition, and mongod is the sole process using it. The only correllating event we could find in mongo logs was:
Sat Nov 3 17:18:48 [FileAllocator] allocating new datafile /var/lib/mongo/enlighten_production/enlighten_production.93, filling with zeroes...
On other FileAllocator events, an ‘allocation complete’ message was logged as well. This one had no matching completion message. Eventually mongod failed:
Sat Nov 3 17:25:41 [journal] LogFile::synchronousAppend failed with 139264 bytes unwritten out of 139264 bytes; b=0x2adec2402000 errno:28 No space left on device
Sat Nov 3 17:25:41 Got signal: 6 (Aborted).
Sat Nov 3 17:25:41 Backtrace:
0xa95ce9 0x3f322302d0 0x3f32230265 0x3f32231d10 0x75c13d 0x777da5 0x777fde 0x762f1c 0x76376d 0x763a8d 0x76436b 0xaabca0 0x3f3360673d 0x3f322d44bd
/usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x3a9) [0xa95ce9]
/lib64/libc.so.6 [0x3f322302d0]
/lib64/libc.so.6(gsignal+0x35) [0x3f32230265]
/lib64/libc.so.6(abort+0x110) [0x3f32231d10]
/usr/bin/mongod(_ZN5mongo7LogFile17synchronousAppendEPKvm+0x12d) [0x75c13d]
/usr/bin/mongod(_ZN5mongo3dur7Journal7journalERKNS0_11JSectHeaderERKNS_14AlignedBuilderE+0x1e5) [0x777da5]
/usr/bin/mongod(_ZN5mongo3dur14WRITETOJOURNALENS0_11JSectHeaderERNS_14AlignedBuilderE+0x4e) [0x777fde]
/usr/bin/mongod(_ZN5mongo3dur28_groupCommitWithLimitedLocksEv+0x24c) [0x762f1c]
/usr/bin/mongod(_ZN5mongo3dur27groupCommitWithLimitedLocksEv+0x1d) [0x76376d]
/usr/bin/mongod [0x763a8d]
/usr/bin/mongod(_ZN5mongo3dur9durThreadEv+0x10b) [0x76436b]
/usr/bin/mongod(thread_proxy+0x80) [0xaabca0]
/lib64/libpthread.so.0 [0x3f3360673d]
/lib64/libc.so.6(clone+0x6d) [0x3f322d44bd]
Sat Nov 3 17:25:41 Invalid access at address: 0
Sat Nov 3 17:25:41 Got signal: 11 (Segmentation fault).
The server was rebooted, and mongo was able to recover itself as secondary. Three days later, the primary server in the set, mongo3 (syslog still had the hostname as cache2), had the same issue:
Nov 6 10:44:46 421762-cache2 kernel: EXT4-fs: Can't allocate: Allocation context details:
Nov 6 10:44:46 421762-cache2 kernel: EXT4-fs: status 1 flags 3104
Nov 6 10:44:46 421762-cache2 kernel: EXT4-fs: orig 1597/1831/64@300060, goal 1597/1830/2048@299008, best 0/0/0@300060 cr 3
Nov 6 10:44:46 421762-cache2 kernel: EXT4-fs: 0 scanned, 0 found
Tue Nov 6 10:44:29 [FileAllocator] allocating new datafile /var/lib/mongo/enlighten_production/enlighten_production.94, filling with zeroes...
...
Tue Nov 6 10:46:11 [journal] LogFile::synchronousAppend failed with 114688 bytes unwritten out of 114688 bytes; b=0x2adccb794000 errno:28 No space left on device
Tue Nov 6 10:46:11 Got signal: 6 (Aborted).
In both cases, there was more than 100GB free on the partition at the time. It looks as though the allocation, or the zeroing, went haywire and filled the whole partition.