Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Incomplete
Priority: Critical - P2
Fix Version/s: None
Affects Version/s: 2.0.6
Component/s: Stability
Labels:
None
Environment:

Hide
uname: Linux 421762-mongo3.enl.enphaseenergy.com 2.6.18-238.45.1.el5 #1 SMP Thu Sep 20 12:19:35 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux
e4fsprogs version 1.41.12-2.el5
mongo version mongo-10gen-server-2.0.6-mongodb_1.x86_64
Hardware: Dell PowerEdge 2970, PowerEdge R710

Show
uname: Linux 421762-mongo3.enl.enphaseenergy.com 2.6.18-238.45.1.el5 #1 SMP Thu Sep 20 12:19:35 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux e4fsprogs version 1.41.12-2.el5 mongo version mongo-10gen-server-2.0.6-mongodb_1.x86_64 Hardware: Dell PowerEdge 2970, PowerEdge R710

Operating System:
ALL
Steps To Reproduce:

Hide

Unable to reproduce intentionally

Show
Unable to reproduce intentionally
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

On November 3, mongo1, a hidden member of the replica set had mongod go unresponsive. Examining system logs, we found thousands of messages like:

Nov 3 17:20:47 276036-mongo1 kernel: EXT4-fs: Can't allocate: Allocation context details:
Nov 3 17:20:47 276036-mongo1 kernel: EXT4-fs: status 1 flags 3104
Nov 3 17:20:47 276036-mongo1 kernel: EXT4-fs: orig 1598/525/1@477864, goal 1598/515/2048@477184, best 0/0/0@477864 cr 3
Nov 3 17:20:47 276036-mongo1 kernel: EXT4-fs: 0 scanned, 0 found

The only EXT4 filesystem on the server is the mongo partition, and mongod is the sole process using it. The only correllating event we could find in mongo logs was:

Sat Nov 3 17:18:48 [FileAllocator] allocating new datafile /var/lib/mongo/enlighten_production/enlighten_production.93, filling with zeroes...

On other FileAllocator events, an ‘allocation complete’ message was logged as well. This one had no matching completion message. Eventually mongod failed:

Sat Nov 3 17:25:41 [journal] LogFile::synchronousAppend failed with 139264 bytes unwritten out of 139264 bytes; b=0x2adec2402000 errno:28 No space left on device
Sat Nov 3 17:25:41 Got signal: 6 (Aborted).

Sat Nov 3 17:25:41 Backtrace:
0xa95ce9 0x3f322302d0 0x3f32230265 0x3f32231d10 0x75c13d 0x777da5 0x777fde 0x762f1c 0x76376d 0x763a8d 0x76436b 0xaabca0 0x3f3360673d 0x3f322d44bd
/usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x3a9) [0xa95ce9]
/lib64/libc.so.6 [0x3f322302d0]
/lib64/libc.so.6(gsignal+0x35) [0x3f32230265]
/lib64/libc.so.6(abort+0x110) [0x3f32231d10]
/usr/bin/mongod(_ZN5mongo7LogFile17synchronousAppendEPKvm+0x12d) [0x75c13d]
/usr/bin/mongod(_ZN5mongo3dur7Journal7journalERKNS0_11JSectHeaderERKNS_14AlignedBuilderE+0x1e5) [0x777da5]
/usr/bin/mongod(_ZN5mongo3dur14WRITETOJOURNALENS0_11JSectHeaderERNS_14AlignedBuilderE+0x4e) [0x777fde]
/usr/bin/mongod(_ZN5mongo3dur28_groupCommitWithLimitedLocksEv+0x24c) [0x762f1c]
/usr/bin/mongod(_ZN5mongo3dur27groupCommitWithLimitedLocksEv+0x1d) [0x76376d]
/usr/bin/mongod [0x763a8d]
/usr/bin/mongod(_ZN5mongo3dur9durThreadEv+0x10b) [0x76436b]
/usr/bin/mongod(thread_proxy+0x80) [0xaabca0]
/lib64/libpthread.so.0 [0x3f3360673d]
/lib64/libc.so.6(clone+0x6d) [0x3f322d44bd]
Sat Nov 3 17:25:41 Invalid access at address: 0
Sat Nov 3 17:25:41 Got signal: 11 (Segmentation fault).

The server was rebooted, and mongo was able to recover itself as secondary. Three days later, the primary server in the set, mongo3 (syslog still had the hostname as cache2), had the same issue:

Nov 6 10:44:46 421762-cache2 kernel: EXT4-fs: Can't allocate: Allocation context details:
Nov 6 10:44:46 421762-cache2 kernel: EXT4-fs: status 1 flags 3104
Nov 6 10:44:46 421762-cache2 kernel: EXT4-fs: orig 1597/1831/64@300060, goal 1597/1830/2048@299008, best 0/0/0@300060 cr 3
Nov 6 10:44:46 421762-cache2 kernel: EXT4-fs: 0 scanned, 0 found

Tue Nov 6 10:44:29 [FileAllocator] allocating new datafile /var/lib/mongo/enlighten_production/enlighten_production.94, filling with zeroes...
...
Tue Nov 6 10:46:11 [journal] LogFile::synchronousAppend failed with 114688 bytes unwritten out of 114688 bytes; b=0x2adccb794000 errno:28 No space left on device
Tue Nov 6 10:46:11 Got signal: 6 (Aborted).

In both cases, there was more than 100GB free on the partition at the time. It looks as though the allocation, or the zeroing, went haywire and filled the whole partition.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

dmesg.txt
40 kB
Nov 27 2012 04:18:42 AM UTC
dmesg.txt
40 kB
Nov 27 2012 04:18:18 AM UTC
mongod.log
590 kB
Nov 27 2012 04:24:34 AM UTC

Assignee:: David Hows (Inactive)
Reporter:: Benjamin Abbott-Scott
Participants:: Benjamin Abbott-Scott, David Hows
Votes:: 0 Vote for this issue
Watchers:: 4 Start watching this issue

Created:: Nov 20 2012 06:27:36 PM UTC
Updated:: Mar 08 2013 03:56:11 PM UTC
Resolved:: Feb 24 2013 11:16:34 PM UTC

Details

Description

Attachments

Attachments

Activity

People

Dates