[SERVER-13352] Socket exception (SEND_ERROR) even after SERVER-9022 applied Created: 26/Mar/14  Updated: 12/Jan/15  Resolved: 12/Jan/15

Status: Closed
Project: Core Server
Component/s: Networking, Sharding
Affects Version/s: 2.4.9
Fix Version/s: 2.6.0

Type: Bug Priority: Critical - P2
Reporter: Alex Piggott Assignee: Ramon Fernandez Marina
Resolution: Done Votes: 6
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
is related to SERVER-7008 socket exception [SEND_ERROR] on Mong... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   
Issue Status as of Jan 09, 2015

ISSUE SUMMARY
Certain network settings and/or events may cause the connection pools used by MongoDB to be populated by "bad" or "broken" connections. Common causes included periodic network failures and firewalls silently killing long running connections, though the actual cause was sometimes impossible to ascertain.

These connections only reveal themselves to be unusable when they are selected from the pool and data is written to them, prior to that they appear to be healthy and usable. This is particularly relevant to large sharded clusters which contain many connection pools (each mongos process and each primary for a shard have connection pools that can be impacted).

USER IMPACT
When a triggering event occurs, some proportion of the idle connections in a connection pool may become unusable, but still look healthy. Over time, as the MongoDB process (mongod or mongos) attempts to use these connections from the pool they may fail, throwing socket exceptions (SEND_ERROR, recv() timeout etc.). These errors occur sporadically (depending on how many connections were affected, and how busy the process was) until such time as the "bad" connections in the pool are exhausted, or the process in question is restarted. Essentially, this often presents as seemingly random socket exceptions long after the trigger event had occurred.

WORKAROUNDS
If there is a suspected regular trigger event occurring then preventing the event in the first place is the best solution. If that proves elusive, the only definitive solution is to restart the impacted processes once such an event has occurred (or is suspected to have occurred) in order to clear out the problematic pools.

The releaseConnectionsAfterResponse parameter (added in 2.2.4 and 2.4.2 as part of SERVER-9022) can help alleviate the issue, but does not eliminate it. Additionally, this parameter must be used judiciously and with caution, per the warning given in SERVER-9022.

AFFECTED VERSIONS
MongoDB versions prior to 2.6.0 are affected by this issue.

FIX VERSION
The fix is included in the 2.6.0 production release.

RESOLUTION DETAILS
MongoDB 2.6 comes with a new connection pooling code that includes the work done in SERVER-9041 to proactively detect the re-use of broken connections from the pool.

Original description

Like some other folks I was encountering the issue described in SERVER-7008 (principally on a cluster with 32 mongos, and 20 mongod forming 10 shards, all running 2.4.9).

The occurrences were a bit random but tended to occur in the mornings and tended to occur early in the week (the latter probably correlated with weekly compaction that occurs on sat night).

The problem would always disappear for 1-2 weeks after a mongos restart.

After applying SERVER-9022, the problems had appeared to have stopped. After ~6 weeks some nodes started to see SEND_ERROR exceptions however. As before a mongos restart fixed everything.

I confirmed that all the servers did have the patch applied (was: true)



 Comments   
Comment by Alex Piggott [ 09/Jan/15 ]

(I'm the original reporter for this, we saw them regularly on a large ~20 node cluster and less frequently on smaller clusters, running 2.2 and 2.4)

Since moving to mongodb 2.6 we have stopped seeing these problems

Comment by sam flint [ 03/Aug/14 ]

I am seeing issues as well. This started as I added in 5th mongod to our cluster on each shard. Currently on 2.4.8 in production.

Comment by Srinivasa Kanamatha [ 29/Jul/14 ]

We are seeing similar issues in our environment. Current version we are on 2.4.5.

Here is the error

"socket exception [SEND_ERROR] for ..."

Comment by Adam Comerford [ 29/Jul/14 ]

Hi Jérémie,

I am the one responsible for writing and recording M202 and I just wanted to clarify the course content regarding connection pooling. I do indeed talk about the need to restart the mongos in certain scenarios, however that section of the course actually states that 2.6 has made it less likely that you need to restart the mongos, so I think there is something of a misunderstanding. When I speak about the need to restart I am referring to versions 2.4 (and below) because there have been specific improvements made in 2.6 around how to handle connections in the pool that have "gone bad". Therefore, the upgrade to 2.6 is definitely a recommended action for people seeing this problem, because (as mentioned in the course) the changes are too complex to backport to 2.4.

Adam

Comment by Alex Piggott [ 07/Jul/14 ]

@Thomas: sorry I haven't had a chance to test it (the change in package structure for 2.6 has messed up our deployment infrastructure, obviously not a big issue to update but does mean I've been waiting for someone to have some free time to do it, vs just being able to update one of our reference instances in 5 minutes)

Comment by Jérémie Charest [ 07/Jul/14 ]

I took the "M202: MongoDB Advanced Deployment and Operations" course recently on MongoDB University. At some point during the course, one presentation illustrated mongos connection pooling and clearly state that under mongo 2.6 we have to restart the mongos if that kind of network/socket error occur (near 7min in the video).

https://www.youtube.com/watch?v=v8IGPu8XLKo

Will it be fix in 2.4.x or we absolutely need to uprade to 2.6? I think it might be a good to mention that information in the mongos documentation.

Comment by Thomas Rueckstiess [ 07/Jul/14 ]

Hi Alex,

Have you had a chance to upgrade your system already? If so, please let us know if the issue reoccurred or if 2.6 fixed the problem for you.

Thanks,
Thomas

Comment by Randolph Tan [ 08/May/14 ]

I tried following the steps in 2.4.9 and I was able to see the socket exception from the mongos, but it does resolve by itself afterwards so I am not sure if we were seeing the same thing. It is still possible to see the same error in 2.6, but since we actively check the status of the connections, the occurrence of this issue is minimized.

Comment by Alex Piggott [ 08/May/14 ]

Randolph - thanks for the update, we haven't moved across to 2.6.1 yet, though it's on our TODO list (we're stockpiling empty chunks!), we'll do that in the next couple of weeks and report back 6 weeks later on whether the issues have stopped

Though seems like Jeremie had possibly found a simple way to reproduce so checking if that is fixed would be a good start

Comment by Randolph Tan [ 08/May/14 ]

Have you tried v2.6.1? It has a fix for SERVER-9041 that should alleviate most of the bad connection cleanup issues.

Comment by Jérémie Charest [ 24/Apr/14 ]

We are experiencing the same behavior on a test shard : 2 repsets (1 primary and 2 slaves on each). Restarting the mongos server "fix" it. The bug SERVER-9041 and comment the last comment seem interesting.

Steps to reproduce :

  • stop the primary of one of the repset (4-dev-db-salt)
  • wait until a slave become primary
    • mongos connections are working at this point
    • rs.status() is fine
  • restart 4-dev-db-salt
    • a priority was assigned to it so it became primary
    • rs.status() is fine
  • after a few time, the error log will be print and the mongos will be unusable for this replica set

infos :

  • all mongos, mongod are on 2.4.8
  • CentOS 6.5
  • internal vm with near zero latency

mongos logs:

Thu Apr 24 15:50:58.631 [conn26] warning: socket exception when initializing on repset2:repset2/4-dev-db-salt:27017, 5-dev-db-salt:27018, 6-dev-db-salt:27018, current connection state is 
 { state: 
		{ 
		  conn: "repset2/4-dev-db-salt:27017,5-dev-db-salt:27018,6-dev-db-salt:27018",
		  vinfo: "mmoTemplate.AvatarCollection @ 2|5||535803133cdb3ed5967912f4",
		  cursor: "(empty)", count: 0, done: false
		},
	retryNext: false,
	init: false,
	finish: false,
	errored: false
 }
:: caused by :: 9001 socket exception [SEND_ERROR] server [4-dev-db-salt:27017]

Comment by Alex Piggott [ 26/Mar/14 ]

Examples from log file when it occurs:

mongos.log.2014-03-21T13-23-45-Fri Mar 21 10:00:27.238 [conn228999] Socket say send() errno:110 Connection timed out 10.45.142.86:27025
mongos.log.2014-03-21T13-23-45:Fri Mar 21 10:00:27.251 [conn228999] warning: socket exception when initializing on replica_set8:replica_set8/10.45.142.211:27025,10.45.142.86:27025, current connection state is { state: { conn: "replica_set8/10.45.142.211:27025,10.45.142.86:27025", vinfo: "doc_metadata.metadata @ 1869|3||528edf5f3a901fb6bb300289", cursor: "(none)", count: 0, done: false }, retryNext: false, init: false, finish: false, errored: false } :: caused by :: 9001 socket exception [SEND_ERROR] server [10.45.142.86:27025]
 
mongos.log.2014-03-21T13-23-45-Fri Mar 21 11:00:14.873 [conn228999] Socket say send() errno:110 Connection timed out 10.170.15.12:27020
mongos.log.2014-03-21T13-23-45-Fri Mar 21 11:00:14.874 [conn228999] Socket say send() errno:110 Connection timed out 10.33.131.5:27022
mongos.log.2014-03-21T13-23-45-Fri Mar 21 11:00:14.876 [conn228999] Socket say send() errno:110 Connection timed out 10.45.142.86:27025
mongos.log.2014-03-21T13-23-45:Fri Mar 21 11:00:14.876 [conn228999] warning: socket exception when initializing on replica_set8:replica_set8/10.45.142.211:27025,10.45.142.86:27025, current connection state is { state: { conn: "replica_set8/10.45.142.211:27025,10.45.142.86:27025", vinfo: "doc_metadata.metadata @ 1869|3||528edf5f3a901fb6bb300289", cursor: "(none)", count: 0, done: false }, retryNext: false, init: false, finish: false, errored: false } :: caused by :: 9001 socket exception [SEND_ERROR] server [10.45.142.86:27025]

Interesting, I hadn't noticed the timeouts when the problem occurred before the SERVER-9022 fix. Fwiw there is very little chance the IPs listed are not responding.

(EDIT: in case it's causing confusion, note the mongod instances live on port (27017 + shardid) where in this case shardid=1...10, mongos lives on 27017, and the configdb lives on port 27016)

EDIT2: note the significance of the timestamps is that there's an hourly test that runs which calls ,amongst other things, db.collection.count() which seems to be the call that fails the most frequently. Note also it's the same (stale?) connection being used in both cases.

Generated at Thu Feb 08 03:31:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.