[SERVER-1633] too many open files, lsof "can't identify protocol" Created: 18/Aug/10 Updated: 29/May/12 Resolved: 06/Sep/10 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding, Stability |
| Affects Version/s: | 1.6.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Jonas | Assignee: | Alberto Lerner |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
we use mongodb with sharding and the java-connector |
||
| Attachments: |
|
| Operating System: | Linux |
| Participants: |
| Description |
|
on all of our 80 appserver, 1 mongos-Daemon is installed. Very frequently, the mongos-Daemon stops working because of "too many open files", which is also spamming the mongos-logfiles. We increased the openfiles limit to 30000 and this fixed the problem for a short time. But soon the problem was there again, and I discovered the real source. Somehow the mongos-Daemons seems to create some kind of idleing zombie-connections. When I do a lsof | grep 'mongos' This is the output: These zombie-connections fill up all the available filedescriptors. For me it looks like a bug in mongos , or the mongodb-java connector (or both). |
| Comments |
| Comment by Alberto Lerner [ 01/Sep/10 ] |
|
Solving this problem – which smells like a single bug – is of the highest priority to us. Your help has been instrumental to isolate it. We appreciate you sending this log and apologize that, besides our best effort, we haven't gotten to the bottom of this. The log you sent has one more trail I'd like to drill down, so that's helpful. But we added instrumentation code to the 1.7 branch. I'll keep you posted. |
| Comment by Jonas [ 01/Sep/10 ] |
|
Hi, I'm sorry to tell you that for now we suspended our tests with mongodb, as for our needs the sharding-functionality was too unstable, we couldn't find the reason and we can't invest any more time. With todays deployment we deactivated mongodb. However yesterday I deployed the nightly-version and I attached the gzipped-logfile of this one day. Perhaps it can be of use. |
| Comment by Eliot Horowitz (Inactive) [ 28/Aug/10 ] |
|
The nightly is up - so you can give it another shot? Thanks for all your patience - really want to figure this out. |
| Comment by auto [ 26/Aug/10 ] |
|
Author: {'login': 'alerner', 'name': 'Alberto Lerner', 'email': 'alerner@10gen.com'}Message: |
| Comment by Alberto Lerner [ 26/Aug/10 ] |
|
We got a good lead using the last log. But we still need to dig deeper to find the source of the problem. Two questions: Might anything out of the ordinary happened Aug 25 at around 8:15 pm on? We got an exponential response time from a particular internal command (setShardVersion) that starts then. What is the platform you're running on? I'll put some changes that will help us peel more layers using the log. If you would be okay with it, they'd get in todays nightly pipeline. |
| Comment by Jonas [ 26/Aug/10 ] |
|
the network between the database and the mongos is fine. I attached the anonymized log of the failing shard server. It runs for 1 or 2 days, and then it starts failing (too many connections ...). |
| Comment by Alberto Lerner [ 23/Aug/10 ] |
|
Jonas, the log indicates a log of connectivity issues. The high number of "create" in connPoolStats reflect it; the mongos reconnecting after getting broken pipes. We were wondering why those are happening. Is the networking bad between mongos and XX.XX.XX.87:3307? Or is that shard crashing? We still can't reproduce this here. |
| Comment by Jonas [ 23/Aug/10 ] |
|
On sv287 there are again these huge connection leaks. I attached the mongos.log . MongoDB shell version: 1.6.0 , , }, |
| Comment by Eliot Horowitz (Inactive) [ 19/Aug/10 ] |
|
Ok - it looks like we're leaking connections to BBBBB |
| Comment by Jonas [ 19/Aug/10 ] |
|
I run the command on one appserver. The very high number of created connections to shard2 (BBBB) is almost the same like the idling "zombie" connections. (its fairly low (about 2000) now because of a recent restart, i will check again in some hours and post the result). I also posted our mongodb config for shard2. on port 3307 the shards are listening > db.runCommand( "connPoolStats" ) , , }, |
| Comment by Eliot Horowitz (Inactive) [ 19/Aug/10 ] |
|
Can you connect to the mongos and run db.runCommand( "connPoolStats" ) |
| Comment by Jonas [ 19/Aug/10 ] |
|
This are the commands to generate the files (i attached the files to this ticket) XXX.XXX.XXX.XXX is the IP of the appserver The mongos is listening on port 3309, so this are the connections from appserver to mongos a linecount on the files: lsof | grep mongos | grep -c "can't identify protocol" The developers said, that the java-mongodb-driver is set to connection pooling and should only open like 75 connections maximum. |
| Comment by Eliot Horowitz (Inactive) [ 18/Aug/10 ] |
|
I'm looking for the socket states and what they are connection too. I..e mongos -> mongod, or app -> mongos |
| Comment by Jonas [ 18/Aug/10 ] |
|
this is our mongos command: What is your theory, or what exactly are you looking for? Perhaps with this information i can gather more useful information. |
| Comment by Eliot Horowitz (Inactive) [ 18/Aug/10 ] |
|
Are you using normal mongo ports? |
| Comment by Jonas [ 18/Aug/10 ] |
|
Here is the netstat output: "netstat -an | grep 270" and here the number currently opened "zombie connections" Some additional information about other TCP-Connection states ' | sort | uniq -c | sort -n -k 1 -r |
| Comment by Eliot Horowitz (Inactive) [ 18/Aug/10 ] |
|
Can you do "netstat -an | grep 270" on the appservers and attach that |