[SERVER-2599] potential connection leak in sharded environment Created: 21/Feb/11  Updated: 17/Mar/11  Resolved: 21/Feb/11

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 1.6.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Benedikt Waldvogel Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:
  1. lsb_release -a
    LSB Version: n/a
    Distributor ID: SUSE LINUX
    Description: openSUSE 11.2 (x86_64)
    Release: 11.2
    Codename: n/a

Attachments: PNG File mongodb_connection_leak.png     Text File mongos.log    
Operating System: Linux
Participants:

 Description   

I have a system setup that is very similar to the architecture described in http://www.snailinaturtleneck.com/blog/2010/08/30/return-of-the-mongo-mailbag/
? 4 servers, 2 shards.
A disturbing fact is that the number of open connections to the two masters seems to be constantly increasing over time (see the attached ganglia graph).

So here are a few weird observations that I made:
1) One mongos has less than 20 client connections but over 30 connections to each mongod
2) The number of open connections reported by netstat is much less than the number of connections reported in db.serverStatus(): 50 vs. 101
3) db.currentOps() shows a few active "writebacklisten" operations that are running for 3+ days (might be related to SERVER-2434)
4) the mongos log contains a couple of client connection timeout errors like: "MessagingPort recv() errno:110 Connection timed out..."

The clients use the Java driver in version 2.4 and the client instance is kept open and reused for each query.



 Comments   
Comment by Eliot Horowitz (Inactive) [ 28/Feb/11 ]

The server sets keep alive, so its not strictly required on the java side, added a case though JAVA-287

I believe serverStatus and netstat disagree because the mongod process isn't notified in the same way the tcp/ip stack is.

Comment by Benedikt Waldvogel [ 28/Feb/11 ]

That's true and I agree that it's a bad behavior of the firewall.
Nevertheless, it would definitely help if the Java driver would set keep alive on the socket (which it currently does not!).
I'm also wondering why db.serverStatus() in the meantime reports 162 connections while netstat shows only 88. Is it a bug?

Comment by Eliot Horowitz (Inactive) [ 25/Feb/11 ]

Keep alive is set.

Default kernel keep alive settings kick in at 2 hours.

Routers aren't supposed to kill idle connections for 2 hours in general.

Comment by Benedikt Waldvogel [ 25/Feb/11 ]

Turns out that the frequent connection timeouts happen because the firewall silently drops connections that are idle for at least one hour.
I'm surprised that the Java driver and mongos keep idle connections open for such a long time and don't even set TCP keep-alive.

Comment by Benedikt Waldvogel [ 23/Feb/11 ]

I've attached the logfile.

Comment by Eliot Horowitz (Inactive) [ 22/Feb/11 ]

Can you attach the mongos log?

Comment by Benedikt Waldvogel [ 22/Feb/11 ]

https://gist.github.com/838638#file_mongod_netstat.txt
https://gist.github.com/838638#file_mongod_server_status.js
https://gist.github.com/838638#file_mongos_conn_pool_stats.js
https://gist.github.com/838638#file_mongod_conn_pool_stats.js

Comment by Eliot Horowitz (Inactive) [ 21/Feb/11 ]

Can you show serverStatus(), connPoolStats and netstat from the same point in time?

Comment by Benedikt Waldvogel [ 21/Feb/11 ]

how do you explain that db.serverStatus() reports over 100 connections while netstat shows only 50?

Comment by Eliot Horowitz (Inactive) [ 21/Feb/11 ]

mongos keeps connections open to all shards all the time.

You can run
db.runCommand( "connPoolStats" )

To see

The writeback connections are also supposed to be there, and live forever.

Generated at Thu Feb 08 03:00:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.