[SERVER-8006] mongod reports "Too many open files" after the first config server was killed and finally crashed Created: 21/Dec/12  Updated: 17/Jul/13  Resolved: 18/Jun/13

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.2.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: stronglee Assignee: Thomas Rueckstiess
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

os: Debian 6.0 kernel: Linux 2.6.38-2-amd64
cluster: a sharded cluster with 4 replica sets. Each rs contains a primary & a second & a arbiter (mongod v2.2.1)


Attachments: Text File log.txt     Text File serverStatus.txt    
Operating System: Linux
Participants:

 Description   

We tried to upgrade mongodb components from 2.2.1 to 2.2.2. (simply by kill & upgrade & start one by one component)
Several minutes after we killed the first config server, one of mongod began to report "Too many open files", and it crashed several minutes later.

The system's limit had already been set:
Limit Soft Limit Hard Limit Units
Max open files 65535 65535 files

I guess in this situation, mongod opens too many files/sockets and forget to close them.
Please look into the attachement, it contains full log from the time we kill the config server to the end.



 Comments   
Comment by stronglee [ 05/Mar/13 ]

Hi Christopher Clarke,

I think we didn't reboot the server at that time because we usually don't reboot servers only if they are down.
But it's really a long time from the bug happened so I am not 100 percent sure of this.
We didn't meet this bug again. Shall I suggest closing this issue for the reason "Can not be replicated" ?

Comment by xofer [ 05/Mar/13 ]

stronglee, I just want to make sure that you know the bug only affects limits on boot. If you start mongod any other way, the limits are set correctly.

Comment by stronglee [ 24/Dec/12 ]

Hi Thomas Rueckstiess,

Thanks for your reply.
We do use start-stop-daemon to start mongod.
However, I am quite sure that our system isn't affected by the Debian bug you described.

1) We had set the limits in the right way just like the articles you recommend.

2) I use not only "ulimit -a" but also "cat /proc/<mongod pid>/limits" commands to see the limits, and both of them tell that the
"open files" is 65535.

3) I wrote a small python program to test the max open files.

#!/usr/bin/python

import os
os.chdir('/home/strlee/test/files')
l = []
for i in xrange(70000):
l.append(open(str, 'w'))

And run it with start-stop-daemon: start-stop-daemon --start --exec /home/strlee/test/f.py
The result is:

Traceback (most recent call last):
File "/home/strlee/test/f.py", line 8, in <module>
IOError: [Errno 24] Too many open files: '65532'

In the normal state, our mongod opens about 3000 files/sockets at most.

lsof | grep mongodb | wc -l
2728

So I think the mongod opened more than 65000 files/socketes when this bug occured.

Thanks

Comment by Thomas Rueckstiess [ 24/Dec/12 ]

Hi,

Do you use start-stop-daemon to start the mongod process?

There is a bug in Debian which I believe hasn't been fixed yet (even though it was first reported in 2005). Basically, it ignores the settings from the limits file for processes started as daemons.

See also the following links that describe the problem and possible solutions:
http://www.jayway.com/2012/02/11/how-to-really-fix-the-too-many-open-files-problem-for-tomcat-in-ubuntu/
http://ubuntuforums.org/showthread.php?t=1583041
http://superuser.com/questions/454465/make-ulimits-work-with-start-stop-daemon

The last link recommends to add an explicit ulimits call to your init.d script before running mongod.

Please could you confirm if you're affected by this Debian bug.

Thanks,
Thomas

Comment by stronglee [ 22/Dec/12 ]

Hi Eliot Horowitz, please see the serverStatus.txt attachment.

Comment by Eliot Horowitz (Inactive) [ 22/Dec/12 ]

Can you send db.serverStatus() then?

Comment by stronglee [ 22/Dec/12 ]

hi Eliot Horowitz, I'm sorry that we didn't use mms for some reasons. But we deployed a monitoring system worked the same way as mms.

Comment by Eliot Horowitz (Inactive) [ 21/Dec/12 ]

Is this node in mms?

Generated at Thu Feb 08 03:16:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.