[SERVER-1633] too many open files, lsof "can't identify protocol" Created: 18/Aug/10  Updated: 29/May/12  Resolved: 06/Sep/10

Status: Closed
Project: Core Server
Component/s: Sharding, Stability
Affects Version/s: 1.6.0
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Jonas Assignee: Alberto Lerner
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

we use mongodb with sharding and the java-connector
80 appservers (tomcat), on every appserver 1 mongos-daemon, connecting to 2 mongod-server


Attachments: File mongod-2010-09-01.log.gz     File mongod.conf     File mongod.log.2010-08-25.gz     File mongos_debug.log.gz     Text File netstat_grep3307.txt     Text File netstat_grep3309.txt    
Operating System: Linux
Participants:

 Description   

on all of our 80 appserver, 1 mongos-Daemon is installed. Very frequently, the mongos-Daemon stops working because of "too many open files", which is also spamming the mongos-logfiles. We increased the openfiles limit to 30000 and this fixed the problem for a short time. But soon the problem was there again, and I discovered the real source. Somehow the mongos-Daemons seems to create some kind of idleing zombie-connections. When I do a

lsof | grep 'mongos'

This is the output:
......thousands of lines .....
mongos 22843 root 8523u sock 0,5 2863975141 can't identify protocol
mongos 22843 root 8524u sock 0,5 2863975143 can't identify protocol
mongos 22843 root 8525u sock 0,5 2863975162 can't identify protocol
mongos 22843 root 8526u sock 0,5 2863975219 can't identify protocol
......thousands of lines ......

These zombie-connections fill up all the available filedescriptors.

For me it looks like a bug in mongos , or the mongodb-java connector (or both).



 Comments   
Comment by Alberto Lerner [ 01/Sep/10 ]

Solving this problem – which smells like a single bug – is of the highest priority to us. Your help has been instrumental to isolate it. We appreciate you sending this log and apologize that, besides our best effort, we haven't gotten to the bottom of this.

The log you sent has one more trail I'd like to drill down, so that's helpful. But we added instrumentation code to the 1.7 branch.

I'll keep you posted.

Comment by Jonas [ 01/Sep/10 ]

Hi, I'm sorry to tell you that for now we suspended our tests with mongodb, as for our needs the sharding-functionality was too unstable, we couldn't find the reason and we can't invest any more time. With todays deployment we deactivated mongodb. However yesterday I deployed the nightly-version and I attached the gzipped-logfile of this one day. Perhaps it can be of use.

Comment by Eliot Horowitz (Inactive) [ 28/Aug/10 ]

The nightly is up - so you can give it another shot?

Thanks for all your patience - really want to figure this out.

Comment by auto [ 26/Aug/10 ]

Author:

{'login': 'alerner', 'name': 'Alberto Lerner', 'email': 'alerner@10gen.com'}

Message: SERVER-1633 turn logging up to uncover bottleneck
http://github.com/mongodb/mongo/commit/9aac39189e67a0abcb84a2c311c981f435c3ffbd

Comment by Alberto Lerner [ 26/Aug/10 ]

We got a good lead using the last log. But we still need to dig deeper to find the source of the problem.

Two questions:

Might anything out of the ordinary happened Aug 25 at around 8:15 pm on? We got an exponential response time from a particular internal command (setShardVersion) that starts then.

What is the platform you're running on? I'll put some changes that will help us peel more layers using the log. If you would be okay with it, they'd get in todays nightly pipeline.

Comment by Jonas [ 26/Aug/10 ]

the network between the database and the mongos is fine. I attached the anonymized log of the failing shard server. It runs for 1 or 2 days, and then it starts failing (too many connections ...).

Comment by Alberto Lerner [ 23/Aug/10 ]

Jonas, the log indicates a log of connectivity issues. The high number of "create" in connPoolStats reflect it; the mongos reconnecting after getting broken pipes. We were wondering why those are happening. Is the networking bad between mongos and XX.XX.XX.87:3307? Or is that shard crashing?

We still can't reproduce this here.

Comment by Jonas [ 23/Aug/10 ]

On sv287 there are again these huge connection leaks. I attached the mongos.log .

MongoDB shell version: 1.6.0
connecting to: 127.0.0.1:3309/test
> db.runCommand( "connPoolStats" )
{
"hosts" : {
"XX.XX.XX.86:3307" :

{ "available" : 1, "created" : 357 }

,
"XX.XX.XX.87:3307" :

{ "available" : 0, "created" : 19516 }

,
"XX.XX.XX.87:3308" :

{ "available" : 1, "created" : 1593 }

},
"ok" : 1
}

Comment by Eliot Horowitz (Inactive) [ 19/Aug/10 ]

Ok - it looks like we're leaking connections to BBBBB
Can you attach the a full mongos log?

Comment by Jonas [ 19/Aug/10 ]

I run the command on one appserver. The very high number of created connections to shard2 (BBBB) is almost the same like the idling "zombie" connections. (its fairly low (about 2000) now because of a recent restart, i will check again in some hours and post the result). I also posted our mongodb config for shard2.

on port 3307 the shards are listening
AAAA = ip of first mongodb shard
BBBB = ip of second mongodb shard
CCCC = the mongodb config server

> db.runCommand( "connPoolStats" )
{
"hosts" : {
"AAAAA:3307" :

{ "available" : 1, "created" : 36 }

,
"BBBBB:3307" :

{ "available" : 0, "created" : 1915 }

,
"CCCCC:3308" :

{ "available" : 0, "created" : 75 }

},
"ok" : 1
}

Comment by Eliot Horowitz (Inactive) [ 19/Aug/10 ]

Can you connect to the mongos and run

db.runCommand( "connPoolStats" )

Comment by Jonas [ 19/Aug/10 ]

This are the commands to generate the files (i attached the files to this ticket)

XXX.XXX.XXX.XXX is the IP of the appserver
YYY.YYY.YYY.YYY is the IP of the mongodb-server
netstat -tn | grep 3307 > netstat_grep3307.txt

The mongos is listening on port 3309, so this are the connections from appserver to mongos
netstat -tn | grep 3309 > netstat_grep3309.txt

a linecount on the files:
1183 netstat_grep3307.txt
370 netstat_grep3309.txt

lsof | grep mongos | grep -c "can't identify protocol"
11561

The developers said, that the java-mongodb-driver is set to connection pooling and should only open like 75 connections maximum.

Comment by Eliot Horowitz (Inactive) [ 18/Aug/10 ]

I'm looking for the socket states and what they are connection too.

I..e mongos -> mongod, or app -> mongos

Comment by Jonas [ 18/Aug/10 ]

this is our mongos command:
/opt/mongo/bin/mongos --port 3309 --quiet --configdb <IPOFCONFIGDB>:3308

What is your theory, or what exactly are you looking for? Perhaps with this information i can gather more useful information.

Comment by Eliot Horowitz (Inactive) [ 18/Aug/10 ]

Are you using normal mongo ports?

Comment by Jonas [ 18/Aug/10 ]

Here is the netstat output:

"netstat -an | grep 270"
tcp 1 0 127.0.0.1:3309 127.0.0.1:42709 CLOSE_WAIT
tcp 1 0 127.0.0.1:3309 127.0.0.1:52706 CLOSE_WAIT

and here the number currently opened "zombie connections"
lsof | grep "mongos" | grep -c "can't identify protocol"
15152

Some additional information about other TCP-Connection states
netstat -nt | awk '

{print $6}

' | sort | uniq -c | sort -n -k 1 -r
1832 ESTABLISHED
633 FIN_WAIT2
504 TIME_WAIT
466 CLOSE_WAIT
11 FIN_WAIT1
2 SYN_SENT
1 Foreign
1

Comment by Eliot Horowitz (Inactive) [ 18/Aug/10 ]

Can you do "netstat -an | grep 270" on the appservers and attach that

Generated at Thu Feb 08 02:57:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.