[SERVER-7725] File allocation leads to full disk, e4fs allocation error Created: 20/Nov/12 Updated: 08/Mar/13 Resolved: 24/Feb/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Stability |
| Affects Version/s: | 2.0.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Benjamin Abbott-Scott | Assignee: | David Hows |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
uname: Linux 421762-mongo3.enl.enphaseenergy.com 2.6.18-238.45.1.el5 #1 SMP Thu Sep 20 12:19:35 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux |
||
| Attachments: |
|
| Operating System: | ALL |
| Steps To Reproduce: | Unable to reproduce intentionally |
| Participants: |
| Description |
|
On November 3, mongo1, a hidden member of the replica set had mongod go unresponsive. Examining system logs, we found thousands of messages like: Nov 3 17:20:47 276036-mongo1 kernel: EXT4-fs: Can't allocate: Allocation context details: The only EXT4 filesystem on the server is the mongo partition, and mongod is the sole process using it. The only correllating event we could find in mongo logs was: Sat Nov 3 17:18:48 [FileAllocator] allocating new datafile /var/lib/mongo/enlighten_production/enlighten_production.93, filling with zeroes... On other FileAllocator events, an ‘allocation complete’ message was logged as well. This one had no matching completion message. Eventually mongod failed: Sat Nov 3 17:25:41 [journal] LogFile::synchronousAppend failed with 139264 bytes unwritten out of 139264 bytes; b=0x2adec2402000 errno:28 No space left on device Sat Nov 3 17:25:41 Backtrace: The server was rebooted, and mongo was able to recover itself as secondary. Three days later, the primary server in the set, mongo3 (syslog still had the hostname as cache2), had the same issue: Nov 6 10:44:46 421762-cache2 kernel: EXT4-fs: Can't allocate: Allocation context details: Tue Nov 6 10:44:29 [FileAllocator] allocating new datafile /var/lib/mongo/enlighten_production/enlighten_production.94, filling with zeroes... In both cases, there was more than 100GB free on the partition at the time. It looks as though the allocation, or the zeroing, went haywire and filled the whole partition. |
| Comments |
| Comment by David Hows [ 27/Nov/12 ] | ||||||||
|
Hi Benjamin, From your dmesg output i can see a couple of things worth noting, but first has the system been rebooted since the issue occurred? And has the issue recurred subsequently? The devices dm-3 and dm-4 are both running ext4 and appear to be running without journal and are running unchecked. Additionally dm-3 has a mount options which it cannot parse and appears to have failed its barrier check.
Can you run the following command to ID which devices map to which of the vol groups?
Can you please attach your /etc/fstab file? I would like to see what mount options are being passed to where. Finally, can you give some background as to why you are running your filesystems in barrier mode and writeback mode? Cheers, David | ||||||||
| Comment by Benjamin Abbott-Scott [ 27/Nov/12 ] | ||||||||
|
mongod log minus connection actions (on the order of 4200/min) | ||||||||
| Comment by Benjamin Abbott-Scott [ 27/Nov/12 ] | ||||||||
|
Attaching dmesg output. SMART not available as the drives are presented by the PERC controller, but omsa reports no predicted failures. bascott@421762-mongo3 20:10 $ df -h sudo /usr/sbin/vgdisplay -v vg2 — Logical volume —
— Physical volumes — | ||||||||
| Comment by David Hows [ 22/Nov/12 ] | ||||||||
|
Hi Benjamin, This particular error comes from the Linux kernel itself and is indicative of some form of error working with the ext4 device. Can you please attach log files mongod instance in question, as I would like to see what it was doing beforehand. Can you also provide:
It may also be worthwhile you're looking to see if RedHat as any information on this issue. Cheers, David |