[SERVER-7008] socket exception [SEND_ERROR] on Mongo Sharding Created: 11/Sep/12  Updated: 11/Jul/16  Resolved: 19/Mar/14

Status: Closed
Project: Core Server
Component/s: Networking, Sharding
Affects Version/s: 2.2.0
Fix Version/s: None

Type: Bug Priority: Blocker - P1
Reporter: Enrico Hofmann Assignee: Randolph Tan
Resolution: Fixed Votes: 20
Labels: None
Environment:

Debian Squeeze / Mongo 2.2.0


Issue Links:
Depends
depends on SERVER-9041 proactively detect broken connections... Closed
Related
related to SERVER-13352 Socket exception (SEND_ERROR) even af... Closed
related to SERVER-9022 Enable mongos mode which releases Sha... Closed
Operating System: ALL
Participants:

 Description   

Hi i get sometimes the following error, this blocks the whole System. After i restart the Mongod Service it works again well. This was not the first time i get this error.

uncaught exception: getlasterror failed: {
	"shards" : [
		"set01/mongo07.luan.local:10011,mongo08.luan.local:10011"
	],
	"ok" : 0,
	"errmsg" : "could not get last error from a shard set01/mongo07.luan.local:10011,mongo08.luan.local:10011 :: caused by :: socket exception [SEND_ERROR] for 172.16.18.7:10011"
}



 Comments   
Comment by Kevin Matulef [ 13/Sep/12 ]

Hi Enrico,

Did you get the message above from mongos, or from one of the mongod shards? If it happens again, do you mind uploading the logs from the machine it happens on, and also the logs from mongo07.luan.local and mongo0.luan.local?

This message indicates that it's having trouble contacting one of the shards, which could happen if it has a connection that's gone stale for some reason. If this is a mongos, you can force it to refresh its connections by typing:

> db.adminCommand( { flushRouterConfig : 1 } )

or if it's a mongod you can type

> db.adminCommand( { connPoolSync : 1 } )

Comment by larry.loi [ 19/Oct/12 ]

I got the same problem, but the below 2 command unable to solve the problems.

Comment by Anton V. Volokhov [ 23/Oct/12 ]

Same problem. Trying to access not partitioned database through mongos

Comment by Joonas Reynders [ 18/Dec/12 ]

Same problem observed here. Our setup is 2 shards, each shard being a replica set of 3 nodes, 1 mongos, 1 config server (which runs on the same machine as one of the shards' primary) (this is dev setup btw). When restarting each node one at a time mongos starts to report this error. To be more exact the sh.status() does not report a problem but when trying to query anything we get SEND_ERROR (from the same replica set node that the config server also houses). After continuing redoing this query everything starts suddenly to work maybe 10 minutes or so after the node restart cycle.

Running mongo 2.2.1

Comment by Reid Morrison [ 04/Jan/13 ]

We are running into a similar situation in production with MongoDB V2.2.2

It appears that with V2.2 the behavior in the Mongo Router process has changed from V2.0.6. In V2.0 when the Router experienced a network failure talking to a remote database server (shard master), it would return a failure message to the client and release the bad/failed connection. As of V2.2 the router process appears to be holding onto these failed connections and keeps re-using the bad connection resulting in the same error message on subsequent calls from the client to the router process.

Can we change the mongo router to close/release these bad connections and not keep them around?

Below is the error message we keep getting back from the mongo router process (the IP address is for the master in one of the shards) :
9001: socket exception [SEND_ERROR] for x.x.x.x:27018

Restarting the mongo router process fixes the issue. That is until a network failure occurs again and we are back to the constant send error above until we manually restart the router process.

There is no information in the Mongo Router process log files about this issue. Everything is normal in the log file. The only interesting point is that the number of connections remains steady at 44, but these are probably just the client side connections, not the connections to the Mongo database servers.

Any help with this issue will be greatly appreciated.

Comment by Randolph Tan [ 04/Jan/13 ]

Interesting, the mongos router process has been using pooled connections even before v2.0. We did a fix in SERVER-4706 for 2.3.2 that could help mitigate the issues, but it currently only works on connections to stand alone servers (not replica sets) and new incoming client connections to mongos.

Comment by Reid Morrison [ 04/Jan/13 ]

I can confirm that running the command below on the mongo router ( mongos ) does not help, a full restart of mongos is required:
db.adminCommand(

{ flushRouterConfig : 1 }

)

Comment by Reid Morrison [ 04/Jan/13 ]

We are definitely seeing a dramatic difference since upgrading production from V2.0.6 to V2.2.2. It only affects the mongos processes in our second (remote) data center that perform writes back to the shard masters in the first data center. Anytime there is a network issue between the 2 data centers all the mongos processes in the remote data center have to be restarted due to the send errors being sent back to the clients.

We run one mongos process on every web and batch processing server to reduce network latency and to improve redundancy.

Comment by James Blackburn [ 23/Jan/13 ]

We get a whole load of these, often in the morning. Which suggests the sockets are timing out overnight and not being noticed. Currently I'm logging all OperationFailures, and these look like:

  File "/users/is/ahlpypi/egg_cache/p/pymongo-2.4.1_1-py2.6-linux-x86_64.egg/pymongo/collection.py", line 598, in find_one
    for result in self.find(spec_or_id, *args, **kwargs).limit(-1):
  File "/users/is/ahlpypi/egg_cache/p/pymongo-2.4.1_1-py2.6-linux-x86_64.egg/pymongo/cursor.py", line 814, in next
    if len(self.__data) or self._refresh():
  File "/users/is/ahlpypi/egg_cache/p/pymongo-2.4.1_1-py2.6-linux-x86_64.egg/pymongo/cursor.py", line 763, in _refresh
    self.__uuid_subtype))
  File "/users/is/ahlpypi/egg_cache/p/pymongo-2.4.1_1-py2.6-linux-x86_64.egg/pymongo/cursor.py", line 720, in __send_message
    self.__uuid_subtype)
  File "/users/is/ahlpypi/egg_cache/p/pymongo-2.4.1_1-py2.6-linux-x86_64.egg/pymongo/helpers.py", line 104, in _unpack_response
    error_object["$err"])
OperationFailure: database error: socket exception [SEND_ERROR] for 10.220.54.55:27118

Retrying seems to resolve the problem (I only see them logged once for a given client). It's a shame that they're not represented as AutoReconnect errors, as it means we have to put AutoReconnect and OperationFailure decorators around every single pymongo call.

Comment by Greg Studer [ 28/Jan/13 ]

Looking more into this. One potential fix/workaround may be keepalive - if it's not set to a low value already, does setting keepalive mitigate the problem?

http://docs.mongodb.org/manual/faq/diagnostics/#does-tcp-keepalive-time-affect-sharded-clusters-and-replica-sets

If anyone has a sample of the logs from mongos during the affected period (from the start of mongos, to include startup parameters), ideally at log level 3 (can be set at runtime), it would be useful to see exactly the timings of what is happening here on a per-conn basis. If they contain sensitive info, opening a SUPPORT ticket (which is user-private) linked to this ticket is an option as well.

Comment by Randolph Tan [ 28/Jan/13 ]

Reid Morrison & James Blackburn: Are the shards replica-sets or stand alone servers? If it's the former, what is the version of the driver you are using and are you using slaveOk/read preference?

Thanks!

Comment by James Blackburn [ 28/Jan/13 ]

The shards are replicasets. We're using pymongo 2.4.1, the clients are coming in with a read-preference set to primary only.

One thing which is (very) non-standard here is that we're LD_PRELOAD'ing of libsdp to run TCP/IP over Infiniband. It looks like this doesn't honour the TCP keepalive request – I see this in the mongos log files:

Mon Jan 28 08:27:24 [conn129] Socket say send() errno:104 Connection reset by peer 10.220.54.17:27118
Mon Jan 28 08:27:24 [conn129] DBException in process: socket exception [SEND_ERROR] for 10.220.54.17:27118
Mon Jan 28 08:27:24 [conn129] ERROR: can't get TCP_KEEPINTVL: errno:92 Protocol not available

Other than these periodic OperationFailures, Mongo over infiniband seems to work very well indeed.

Comment by Reid Morrison [ 28/Jan/13 ]

We are running 5 shards each of which is a replica-set. We use the Ruby and Go-lang drivers so it is not likely to related to the driver in any way.

To reproduce the problem:
1. Setup a shard of X replica sets
2. Run the mongos instance on your local machine
3. Perform a query from a test program which goes through the local mongos
4. Break the network connection between the master and mongos
5. Perform another query which should fail with a connection lost OperationFailure
6. Re-establish the network connection between mongos and the master
7. Perform another query which should fail with a connection lost OperationFailure

Step 7 may have to be repeated several times since the connections between mongos and the master are in a connection pool in mongos so there is no guarantee that the bad connection will be used on the first attempt

Once mongos is in this state the only way to fix it is to restart the mongos process. It is not necessary to restart the client.

The longer mongos is left running the greater the number of these bad connections in its connection pool, thereby increasing the likelihood of getting a bad connection when a call is made via mongos.

Comment by Reid Morrison [ 28/Jan/13 ]

Yes, we are using the slaveOk read preference. The network connection issues are generally to the replica-set masters since they are on a different network segment.

It is possible that the network connections to the master are going "stale" since all reads are to the slaves and writes are seldom performed.

Comment by Reid Morrison [ 28/Jan/13 ]

I confirmed that on all of the Redhat Linux servers we have the tcp keep-alive time set to 300

Comment by DisU [ 04/Feb/13 ]

Any news on this? We've been experiencing the same issue in the last few months.

Comment by James Blackburn [ 01/Mar/13 ]

I've written a multi-process batch job. What's interesting is that when this error occurs against the local mongos, it seems to occur consistently for one PID on the local machine at a time. Trying to get it to happen reliably is difficult, and it randomly fixes itself. But for the duration (order of minutes) it appears that the given mongo connection doesn't work.

Comment by James Blackburn [ 14/Mar/13 ]

We've been trialling Mongo here, and this is the bug that hurts us most.

It's difficult to workaround too - as it's unclear how many stale connections the mongos has especially in the morning after a night without use. So we end up retrying 5 times (around every mongo operation) and sometimes that isn't enough.

Comment by Matt Bailey [ 18/Mar/13 ]

Is there anything we can do to mitigate this issue, or are we really going to have to wait for 2.6 for this to be addressed?

Comment by DisU [ 18/Mar/13 ]

I completely agree that this issue needs to be addressed, even as part of 2.2.4. Retrying five times is not an acceptable workaround. We're considering migrating away from Mongo because of this and similar issues.

Comment by Adam Neumann [ 18/Mar/13 ]

@DisU seconded. We use Mongo because of it's easy clustering. If this doesn't work reliably, this is a real concern for us and the service we provide to our customers!

Comment by Luca Bonmassar [ 18/Mar/13 ]

We've been advocating Mongo for a long time, and this is the first time a bug hit us so badly in production and for such a long time. The answer of disabling the sharding or wait till 2.6 (2 major versions away!) it's simply a joke. We have built scripts and monitors to detect when this happens and try to restart faulty mongo routers (restart everything feels very Windows 95 btw). We were using Mongo for a high performance cluster, and we're down basically every other hour.

We have been early adopters and supporters while a lot of negative criticisms were depicting Mongo as a toy.
Today we really feel you let us down, and we've initiated our plan to ditch Mongo for others data storage. This is not only hurting our business, but also your business partners - like MongoHQ, that they have been great with us. It's a sad day.

Comment by Eliot Horowitz [ 19/Mar/13 ]

Definitely don't have to wait for 2.6.

We're working with one user on a patch that seems to help a lot in 2.2

I think the problem is that there are a lot of potential causes, some networking, some system, so its been very hard to get a clear picture of what's happening.

Should have a patch for 2.2.4 or 2.2.5 that in most cases should greatly reduce this issue, or at least rule out large buckets of potential issues.

Comment by Eliot Horowitz [ 19/Mar/13 ]

There are 2 other changes in 2.2.4 that might be related as well in some cases: SERVER-8786 and SERVER-8648

Comment by Matt Bailey [ 19/Mar/13 ]

@eliot Excellent, really appreciate the update from 10gen, seemed pretty quiet on this ticket.

Comment by Eliot Horowitz [ 19/Mar/13 ]

matt - yes, much too quiet. There was a lot of internal communication about it, so didn't realize so little made it out publicly.

Comment by Geordios Santos [ 02/Apr/13 ]

Yes it is very quiet and still very quiet. Is there any update on this?

Comment by Randolph Tan [ 02/Apr/13 ]

Hi,

We are still working on this. We are current experimenting on a fix explained in this linked ticket which is planned for 2.5.

Comment by Eliot Horowitz [ 02/Apr/13 ]

For 2.2 and 2.4, there is a new option in SERVER-9022.
When 2.2.4 comes out (this week) would be great if some people could help test it.
Its off by default for reasons described in that ticket.

Comment by Reid Morrison [ 02/Apr/13 ]

Thank you we will give it a try as soon as it is available.

The problem is not so much that it is not returning a connection to the pool. The problem is that it is returning bad/failed connections to the pool. When a socket error occurs on a connection it should be removed from the pool and destroyed. What we are seeing is that bad connections in the pool are being re-used resulting in constant send failures in mongos for that bad connection.

Comment by James Blackburn [ 02/Apr/13 ]

Indeed, I don't think Server-9022 will help.

My experience is that idle sockets collect in the global pool. When they are next used, they error. It would be great if sockets in the global pool expired after a fixed timeout. This would prevent the collection of stale sockets and OperationFailure exceptions we see every morning.

Comment by Eliot Horowitz [ 02/Apr/13 ]

James - why do you think they are coming out of the global pool?

Comment by James Blackburn [ 03/Apr/13 ]

I'd assumed this because the errors often seem to be clustered in the morning:

list(c.cube.ahl_mongo_events.find({'d.error': {'$regex': '.* socket exception .*'}}, fields={'t':1, '_id':0}))
...
 {u't': datetime.datetime(2013, 3, 13, 7, 56, 33, 957000)},
 {u't': datetime.datetime(2013, 3, 13, 7, 56, 34, 28000)},
 {u't': datetime.datetime(2013, 3, 13, 7, 56, 34, 252000)},
 {u't': datetime.datetime(2013, 3, 13, 8, 21, 10, 997000)},
 {u't': datetime.datetime(2013, 3, 13, 8, 49, 11, 866000)},
 {u't': datetime.datetime(2013, 3, 13, 9, 28, 15, 914000)},
 {u't': datetime.datetime(2013, 3, 13, 10, 3, 10, 779000)},
 {u't': datetime.datetime(2013, 3, 13, 10, 3, 10, 812000)},
 {u't': datetime.datetime(2013, 3, 14, 8, 30, 19, 850000)},
 {u't': datetime.datetime(2013, 3, 14, 8, 30, 19, 987000)},
 {u't': datetime.datetime(2013, 3, 14, 9, 53, 17, 560000)},
 {u't': datetime.datetime(2013, 3, 14, 9, 53, 17, 628000)},
 {u't': datetime.datetime(2013, 3, 14, 9, 53, 17, 666000)},
 {u't': datetime.datetime(2013, 3, 14, 9, 54, 2, 400000)},
 {u't': datetime.datetime(2013, 3, 14, 9, 54, 2, 432000)},
 {u't': datetime.datetime(2013, 3, 14, 10, 30, 51, 442000)},
 {u't': datetime.datetime(2013, 3, 14, 11, 20, 49, 533000)},
 {u't': datetime.datetime(2013, 3, 14, 14, 48, 44, 212000)},
...

Perhaps the assumption is wrong, as the exceptions do happen in the afternoon too.

Digging into the tracebacks, they nearly always occur in a cursor 'refresh', which I hadn't noticed before:

c.cube.ahl_mongo_events.find({'d.error': {'$regex': '.* socket exception .*'}}).distinct('d.traceback')
...
  File "/users/is/ahlpypi/egg_cache/p/pymongo-2.4.2-py2.6-linux-x86_64.egg/pymongo/cursor.py", line 814, in next
    if len(self.__data) or self._refresh():
  File "/users/is/ahlpypi/egg_cache/p/pymongo-2.4.2-py2.6-linux-x86_64.egg/pymongo/cursor.py", line 763, in _refresh
    self.__uuid_subtype))
  File "/users/is/ahlpypi/egg_cache/p/pymongo-2.4.2-py2.6-linux-x86_64.egg/pymongo/cursor.py", line 720, in __send_message
    self.__uuid_subtype)
  File "/users/is/ahlpypi/egg_cache/p/pymongo-2.4.2-py2.6-linux-x86_64.egg/pymongo/helpers.py", line 105, in _unpack_response
    error_object["$err"])
OperationFailure: database error: socket exception [SEND_ERROR] for 10.220.54.51:27118
...
 
len([x for x in c.cube.ahl_mongo_events.find({'d.error': {'$regex': '.* socket exception .*'}}) if '_refresh' in x['d']['traceback']])
Out[81]: 446
len([x for x in c.cube.ahl_mongo_events.find({'d.error': {'$regex': '.* socket exception .*'}}) if '_refresh' not in x['d']['traceback']])
Out[82]: 25
 
Those 25 are errors in waiting for command response (_check_command_response in the Python):
 
[x['d']['error'] for x in c.cube.ahl_mongo_events.find({'d.error': {'$regex': '.* socket exception .*'}}) if '_refresh' not in x['d']['traceback']]
Out[92]: 
[u"command SON([('findAndModify', u'EOD.version_nums'), ....]) failed: exception: socket exception [SEND_ERROR] for 10.220.54.14:27118",
 u'command SON([(\'count\', u\'EOD\'), (\'fields\', None), ...)]) failed: exception: socket exception [CONNECT_ERROR] for rs1/cn54-ib:27118,cn55-ib:27118',
... ]
 

Comment by Eliot Horowitz [ 03/Apr/13 ]

James - I think the evidence is inconclusive at best.

2.2.4 is out, so assuming the caveat's listed in SERVER-9022 do not apply, I would encourage you to try that.

Comment by Andrey Godin [ 08/Sep/13 ]

Hi!
Somebody tried to work options in SERVER-9022?

This bug makes it impossible to develop the project in conjunction with mongodb

Comment by Randolph Tan [ 12/Sep/13 ]

Hi,

We also made some changes in 2.5 that would clear the bad connections sooner in mongos with SERVER-9041.

Thanks!

Comment by Eyllo [ 25/Oct/13 ]

Hi there,

We are also hitting this problem using 2.4.7 inside Microsoft Azure and having all of our servers within the same affinity group and region. Any ideas when this will be fixed? or how it can be solved?

Comment by Andrey Aleksandrov [ 16/Nov/13 ]

Hi

The same problem

MongoS version 2.4.8 starting: pid=14289 port=27017 64-bit host=second-w3.x.xx.xx (--help for usage)
git version: a350fc38922fbda2cec8d5dd842237b904eafc14
build sys info: Linux ip-10-2-29-40 2.6.21.7-2.ec2.v1.2.fc8xen #1 SMP Fri Nov 20 17:48:28 EST 2009 x86_64 BOOST_LIB_VERSION=1_49

Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/bottle.py", line 764, in _handle
return route.call(**args)
File "/usr/lib/python2.6/site-packages/bottle.py", line 1575, in wrapper
rv = callback(*a, **ka)
File "/var/www/desk/mobile.py", line 526, in article
content = self.store. xx_ru.Name.find_one(

{"_id": int(article_id)}

)
File "/usr/lib64/python2.6/site-packages/pymongo/collection.py", line 604, in find_one
for result in self.find(spec_or_id, *args, **kwargs).limit(-1):
File "/usr/lib64/python2.6/site-packages/pymongo/cursor.py", line 904, in next
if len(self.__data) or self._refresh():
File "/usr/lib64/python2.6/site-packages/pymongo/cursor.py", line 848, in _refresh
self.__uuid_subtype))
File "/usr/lib64/python2.6/site-packages/pymongo/cursor.py", line 800, in __send_message
self.__uuid_subtype)
File "/usr/lib64/python2.6/site-packages/pymongo/helpers.py", line 100, in _unpack_response
error_object["$err"])
OperationFailure: database error: socket exception [SEND_ERROR] for 10.0.xx.xx:27117

bill goes to the days of the conclusion of the system is needed in productivity

Comment by bg [ 10/Dec/13 ]

Getting the same problem with a 2.4.5 cluster. Started happening after some of the members were rebooted. Does not seem to be clearing itself, but can't find anything wrong...

Comment by Jesse Cai [ 21/Jan/14 ]

Confirmed this issue still exist in 2.4.6, it happening after some of the members were rebooted. it will back to normal after a full restart of mongos.

Comment by Alex Piggott [ 21/Jan/14 ]

For what it's worth - I've been getting this quite a lot (on non-trivial "count()" commands only) since I built a 28-node cluster (14 shards x 2 replicas) ... My previous largest cluster was 4 nodes (2 shards x 2 replicas), and only saw this error very intermittently (and not in clusters, just the odd one off from time to time).

Monday morning every single mongos (on 14 separate application nodes) was in this errored state. Restarting each mongos fixed it on that mongos - but in addition just typing "use <database>", "db.<collection>.count()" in the local shell also appeared to fix it (for the <database> and <collection> in question).

Thought I'd mention these data points in case they help diagnose the problem, and in addition if the simple "count()" on the collection really fixes it (in some cases?!), then that might be a workaround for some people pending a full fix.

Has anyone actually tried SERVER-9022? The problem only occurs on my operational clusters, I don't really want mess about with a new mongos mode unless there's a good reason.

Comment by Randolph Tan [ 21/Jan/14 ]

I believe that it was a coincidence that calling count fixes the problem. This issue would eventually resolve itself once all the bad connections are replaced with new ones. Prior to SERVER-9041, mongos will finds out if the connection was bad/closed only after using it and throwing an error.

Comment by Alex Piggott [ 21/Jan/14 ]

I would be odd if it was a complete coincidence given how consistent the behaviour was:

  • I had a GUI that calls find(XXX).count() via the JAVA driver when I press a button. I pressed the button several times, each time I got an error.
  • I then entered the shell via ssh, ran the count command (which always worked), and then pressed the button again several times and now each time it worked.

I did this on a few different servers.

Is it possible that logging into the mongos via the shell (or calling the "use" command, or calling the "count" command) resets the connections?

Incidentally will "db.adminCommand(

{ flushRouterConfig : 1 }

)" break any commands that are ongoing when I run it?

I'd like to run it regularly until we can move to a version that has a SERVER-9041 (unless that will cause performance issues, eg if I run it every 10 minutes?) and see if that prevents the problem from occurring.

Comment by Alex Piggott [ 27/Jan/14 ]

It just happened again, this time running a command from the shell did not fix it (there must just have been a cluster of working servers that the round robin happened to hit when I was trying that).

I also tried the "db.adminCommand(

{ flushRouterConfig : 1 }

)" and (as with one of the other commenters) that also definitely did not work

Restarting the mongos (and then all of the follow on processes connected via a Mongo object) did work, as before.

It would be really handy if someone from 10gen could confirm whether SERVER-9022 will work given the flushRouterConfig doesn't? If it is believed to work I will try it out and report back, otherwise I suppose I'll try restarting at 2am every few days in case it's a "leak" type problem rather than a "race" type problem.

Comment by Dan Pasette [ 28/Jan/14 ]

SERVER-9022 will mitigate the issue under most circumstances, but not completely fix the issue. If your application conforms to the caveats described in SERVER-9022 (most apps should conform), then I encourage you to turn on releaseConnectionsAfterResponse on a single mongos and monitor the difference. This setting will effectively be on in 2.6 by default (the option will be deprecated in 2.6).

Comment by Alex Piggott [ 26/Feb/14 ]

In case anyone was wondering, since I applied SERVER-9022 to my servers this problem has completely disappeared and I have seen no other ill effects. Thanks!

Comment by Alex Piggott [ 24/Mar/14 ]

hmm i spoke too soon, i added this fix at the start of feb and from early last week (ie mid march) the same problem started recurring, across more and more nodes (32 nodes running mongos 2.4.9, 20 of them run mongod in 10 shards) over a few days, as before.

As before, a mass mongos restart fixed everything.

it looks like you missed a less frequent case of the same sort of problem (before the fix, it happened every 1-2 weeks)?

Should his issue be reopened, or SERVER-9022, or should i create a new issue?

Comment by Eliot Horowitz [ 24/Mar/14 ]

Alex - can you open a new ticket (and link here) just to make sure we don't confuse two issues?

Comment by Alex Piggott [ 26/Mar/14 ]

Created SERVER-13352

Comment by Vishal Katikineni [ 29/Jul/14 ]

We have a similar issue too even after setting the parameter releaseConnectionsAfterResponse=true.

Generated at Wed Sep 26 08:36:51 UTC 2018 using Jira 7.12.1#712002-sha1:609a50578ba6bc73dbf8b05dddd7c04a04b6807c.