[SERVER-28974] Mongos leak connections to mongods Created: 26/Apr/17  Updated: 27/Oct/23  Resolved: 09/Jun/17

Status: Closed
Project: Core Server
Component/s: Networking
Affects Version/s: 3.2.12
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Tuomas Silen Assignee: Kelsey Schubert
Resolution: Gone away Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-28933 Mongod terminates when out of file de... Closed
Operating System: ALL
Steps To Reproduce:

Unknown what the exact trigger is, but it has happened multiple times now during high traffic peaks which have caused congestion in the internal network.

Participants:

 Description   

We ran into an issue where mongos was suddenly creating a lot of connections to mongods and ran out of file descriptors. This eventually consumed all descriptors also on mongods which then crashed (https://jira.mongodb.org/browse/SERVER-28933).

This happened when there was some intermittent network congestion.

We saw errors like this when the leaks occurred:

2017-04-24T21:48:41.865+0000 I NETWORK  [ReplicaSetMonitorWatcher] Socket recv() timeout  10.123.0.1:27017
2017-04-24T21:48:41.865+0000 I NETWORK  [ReplicaSetMonitorWatcher] SocketException: remote: (NONE):0 error: 9001 socket exception [RECV_TIMEOUT] server [10.123.0.1:27017]
2017-04-24T21:48:41.865+0000 I NETWORK  [ReplicaSetMonitorWatcher] Detected bad connection created at 1493026624910579 microSec, clearing pool for mongo5:27017 of 0 connections

"0 connections" there is particularly interesting.

Eventually the leaking lead to running out of descriptors:

2017-04-24T21:48:42.487+0000 I NETWORK  [mongosMain] Listener: accept() returns -1 errno:24 Too many open files
2017-04-24T21:48:42.487+0000 E NETWORK  [mongosMain] Out of file descriptors. Waiting one second before trying to accept more connections.

The logs indicated that there would be about 500 open client connections, but when inspecting with lsof, it revealed that there were less than 10. It also revealed that there were 6 connections to the mongocs, 23 connections to the primary and total of 474 connections to the two secondaries. It also showed 478 "sock" descriptors with "can't identify protocol":
mongos 5495 mongodb 1023u sock 0,7 0t0 84718243 can't identify protocol

Normally there isn't any of those and there's just a few connections to the secondaries as well.

This is currently a bit special setup with a single shard (replicaset of 3 nodes) and about 200 mongoses that connect to it. The leaking occurred on all of the mongos instances (all were also affected by the network congestion issue).

This is what connPoolStatus showed around that time (although, I've understood that it's not entirely reliable):

MongoDB shell version: 3.2.12
connecting to: test
mongos> db.runCommand( { "connPoolStats" : 1 } )
{
	"numClientConnections" : 6,
	"numAScopedConnections" : 0,
	"totalInUse" : 478,
	"totalAvailable" : 24,
	"totalCreated" : 5000,
	"hosts" : {
		"mongocdata3:27019" : {
			"inUse" : 0,
			"available" : 1,
			"created" : 1
		},
		"mongocdata2:27019" : {
			"inUse" : 0,
			"available" : 1,
			"created" : 1
		},
		"mongocdata1:27019" : {
			"inUse" : 0,
			"available" : 1,
			"created" : 1
		},
		"mongo7:27017" : {
			"inUse" : 0,
			"available" : 1,
			"created" : 7
		},
		"mongo6:27017" : {
			"inUse" : 0,
			"available" : 1,
			"created" : 4
		},
		"mongo5:27017" : {
			"inUse" : 0,
			"available" : 1,
			"created" : 18
		}
	},
	"replicaSets" : {
		"datacfg" : {
			"hosts" : [
				{
					"addr" : "mongocdata1:27019",
					"ok" : true,
					"ismaster" : false,
					"hidden" : false,
					"secondary" : true,
					"pingTimeMillis" : 0
				},
				{
					"addr" : "mongocdata2:27019",
					"ok" : true,
					"ismaster" : true,
					"hidden" : false,
					"secondary" : false,
					"pingTimeMillis" : 0
				},
				{
					"addr" : "mongocdata3:27019",
					"ok" : true,
					"ismaster" : false,
					"hidden" : false,
					"secondary" : true,
					"pingTimeMillis" : 0
				}
			]
		},
		"rs1" : {
			"hosts" : [
				{
					"addr" : "mongo5:27017",
					"ok" : true,
					"ismaster" : true,
					"hidden" : false,
					"secondary" : false,
					"pingTimeMillis" : 0
				},
				{
					"addr" : "mongo6:27017",
					"ok" : true,
					"ismaster" : false,
					"hidden" : false,
					"secondary" : true,
					"pingTimeMillis" : 9
				},
				{
					"addr" : "mongo7:27017",
					"ok" : true,
					"ismaster" : false,
					"hidden" : false,
					"secondary" : true,
					"pingTimeMillis" : 4
				}
			]
		}
	},
	"ok" : 1
}



 Comments   
Comment by Kelsey Schubert [ 09/Jun/17 ]

Hi devastor,

Thank you for the update. Since this is no longer an issue for you, I'm resolving this ticket. Please let us know if you encounter similar behavior in the future, and we will continue to investigate.

Kind regards,
Thomas

Comment by Tuomas Silen [ 30/May/17 ]

Hi Thomas,

Unfortunately, we have not seen it after upgrading. Also our setup changed a bit so we don't have that many mongoses any more, which might affect it, too.

Comment by Kelsey Schubert [ 24/May/17 ]

Hi devastor,

Have you encountered this issue again since upgrading?

Thanks,
Thomas

Comment by Tuomas Silen [ 03/May/17 ]

Thanks Thomas,

Just to update, we are now running 3.2.13 and periodically logging the output of connPoolStats, so let's see what happens.

Comment by Kelsey Schubert [ 27/Apr/17 ]

Hi devastor,

Thank you for reporting this behavior. We've completed a large amount of work in MongoDB 3.2.13 to improve diagnostics to investigate issues such as this. The release candidate, MongoDB 3.2.13-rc0, is available now, and general availability release is scheduled for early next week. In particular MongoDB 3.2.13 includes:

Would you please upgrade and periodically collect the output of connPoolStats? Please feel free to increase the delay (in seconds) to a larger number that would still show the issue emerge according to the timelines you've observed previously.

delay=1
mongo --eval "while(true) {print(JSON.stringify(db.serverStatus().localTime));print(JSON.stringify(db.runCommand( { "connPoolStats" : 1 } ))); sleep(1000*${delay:?})}" >connPoolStats.log &

If you encounter this issue again, would you please upload the following diagnostic information:

  1. The connPoolStats.log file
  2. Complete log files of an affected mongos
  3. Complete log files of an affected mongod
  4. An archive of the diagnostic.data directory of the affected mongod

I've created a secure upload portal for you to provide these files. Files uploaded to this portal are only visible to MongoDB employees and are routinely deleted after some time.

Thank you for your help,
Thomas

Generated at Thu Feb 08 04:19:34 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.