[SERVER-7008] socket exception [SEND_ERROR] on Mongo Sharding Created: 11/Sep/12 Updated: 06/Apr/23 Resolved: 19/Mar/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Networking, Sharding |
| Affects Version/s: | 2.2.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker - P1 |
| Reporter: | Enrico Hofmann | Assignee: | Randolph Tan |
| Resolution: | Done | Votes: | 20 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Debian Squeeze / Mongo 2.2.0 |
||
| Issue Links: |
|
||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Participants: |
Alex Piggott, Andrey Aleksandrov, Andrey Godin, Anton V. Volokhov, Daniel Pasette, DisU, Eliot Horowitz, Enrico Hofmann, Eyllo, Geordios Santos, Greg Studer, James Blackburn, Jesse Cai, Joonas Reynders, Kevin Matulef, larry.loi, Luca Bonmassar, Matt Bailey, noizwaves, Randolph Tan, Reid Morrison, Scott Lowe, Vishal Katikineni
|
||||||||||||||||||||
| Description |
|
Hi i get sometimes the following error, this blocks the whole System. After i restart the Mongod Service it works again well. This was not the first time i get this error.
|
| Comments |
| Comment by Vishal Katikineni [ 29/Jul/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
We have a similar issue too even after setting the parameter releaseConnectionsAfterResponse=true. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alex Piggott [ 26/Mar/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Created | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 24/Mar/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Alex - can you open a new ticket (and link here) just to make sure we don't confuse two issues? | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alex Piggott [ 24/Mar/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
hmm i spoke too soon, i added this fix at the start of feb and from early last week (ie mid march) the same problem started recurring, across more and more nodes (32 nodes running mongos 2.4.9, 20 of them run mongod in 10 shards) over a few days, as before. As before, a mass mongos restart fixed everything. it looks like you missed a less frequent case of the same sort of problem (before the fix, it happened every 1-2 weeks)? Should his issue be reopened, or | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alex Piggott [ 26/Feb/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
In case anyone was wondering, since I applied | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Pasette (Inactive) [ 28/Jan/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
| |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alex Piggott [ 27/Jan/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
It just happened again, this time running a command from the shell did not fix it (there must just have been a cluster of working servers that the round robin happened to hit when I was trying that). I also tried the "db.adminCommand( { flushRouterConfig : 1 })" and (as with one of the other commenters) that also definitely did not work Restarting the mongos (and then all of the follow on processes connected via a Mongo object) did work, as before. It would be really handy if someone from 10gen could confirm whether | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alex Piggott [ 21/Jan/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
I would be odd if it was a complete coincidence given how consistent the behaviour was:
I did this on a few different servers. Is it possible that logging into the mongos via the shell (or calling the "use" command, or calling the "count" command) resets the connections? Incidentally will "db.adminCommand( { flushRouterConfig : 1 })" break any commands that are ongoing when I run it? I'd like to run it regularly until we can move to a version that has a | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Randolph Tan [ 21/Jan/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
I believe that it was a coincidence that calling count fixes the problem. This issue would eventually resolve itself once all the bad connections are replaced with new ones. Prior to | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alex Piggott [ 21/Jan/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
For what it's worth - I've been getting this quite a lot (on non-trivial "count()" commands only) since I built a 28-node cluster (14 shards x 2 replicas) ... My previous largest cluster was 4 nodes (2 shards x 2 replicas), and only saw this error very intermittently (and not in clusters, just the odd one off from time to time). Monday morning every single mongos (on 14 separate application nodes) was in this errored state. Restarting each mongos fixed it on that mongos - but in addition just typing "use <database>", "db.<collection>.count()" in the local shell also appeared to fix it (for the <database> and <collection> in question). Thought I'd mention these data points in case they help diagnose the problem, and in addition if the simple "count()" on the collection really fixes it (in some cases?!), then that might be a workaround for some people pending a full fix. Has anyone actually tried | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jesse Cai [ 21/Jan/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Confirmed this issue still exist in 2.4.6, it happening after some of the members were rebooted. it will back to normal after a full restart of mongos. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Scott Lowe [ 10/Dec/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Getting the same problem with a 2.4.5 cluster. Started happening after some of the members were rebooted. Does not seem to be clearing itself, but can't find anything wrong... | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrey Aleksandrov [ 16/Nov/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Hi The same problem MongoS version 2.4.8 starting: pid=14289 port=27017 64-bit host=second-w3.x.xx.xx (--help for usage) Traceback (most recent call last): ) bill goes to the days of the conclusion of the system is needed in productivity | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eyllo [ 25/Oct/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Hi there, We are also hitting this problem using 2.4.7 inside Microsoft Azure and having all of our servers within the same affinity group and region. Any ideas when this will be fixed? or how it can be solved? | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Randolph Tan [ 12/Sep/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Hi, We also made some changes in 2.5 that would clear the bad connections sooner in mongos with Thanks! | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrey Godin [ 08/Sep/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Hi! This bug makes it impossible to develop the project in conjunction with mongodb | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 03/Apr/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
James - I think the evidence is inconclusive at best. 2.2.4 is out, so assuming the caveat's listed in | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by James Blackburn [ 03/Apr/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
I'd assumed this because the errors often seem to be clustered in the morning:
Perhaps the assumption is wrong, as the exceptions do happen in the afternoon too. Digging into the tracebacks, they nearly always occur in a cursor 'refresh', which I hadn't noticed before:
| |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 02/Apr/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
James - why do you think they are coming out of the global pool? | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by James Blackburn [ 02/Apr/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Indeed, I don't think Server-9022 will help. My experience is that idle sockets collect in the global pool. When they are next used, they error. It would be great if sockets in the global pool expired after a fixed timeout. This would prevent the collection of stale sockets and OperationFailure exceptions we see every morning. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Reid Morrison [ 02/Apr/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Thank you we will give it a try as soon as it is available. The problem is not so much that it is not returning a connection to the pool. The problem is that it is returning bad/failed connections to the pool. When a socket error occurs on a connection it should be removed from the pool and destroyed. What we are seeing is that bad connections in the pool are being re-used resulting in constant send failures in mongos for that bad connection. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 02/Apr/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
For 2.2 and 2.4, there is a new option in | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Randolph Tan [ 02/Apr/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Hi, We are still working on this. We are current experimenting on a fix explained in this linked | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Geordios Santos [ 02/Apr/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Yes it is very quiet and still very quiet. Is there any update on this? | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 19/Mar/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
matt - yes, much too quiet. There was a lot of internal communication about it, so didn't realize so little made it out publicly. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Matt Bailey [ 19/Mar/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
@eliot Excellent, really appreciate the update from 10gen, seemed pretty quiet on this ticket. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 19/Mar/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
There are 2 other changes in 2.2.4 that might be related as well in some cases: | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 19/Mar/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Definitely don't have to wait for 2.6. We're working with one user on a patch that seems to help a lot in 2.2 I think the problem is that there are a lot of potential causes, some networking, some system, so its been very hard to get a clear picture of what's happening. Should have a patch for 2.2.4 or 2.2.5 that in most cases should greatly reduce this issue, or at least rule out large buckets of potential issues. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Luca Bonmassar [ 18/Mar/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
We've been advocating Mongo for a long time, and this is the first time a bug hit us so badly in production and for such a long time. The answer of disabling the sharding or wait till 2.6 (2 major versions away!) it's simply a joke. We have built scripts and monitors to detect when this happens and try to restart faulty mongo routers (restart everything feels very Windows 95 btw). We were using Mongo for a high performance cluster, and we're down basically every other hour. We have been early adopters and supporters while a lot of negative criticisms were depicting Mongo as a toy. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by noizwaves [ 18/Mar/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
@DisU seconded. We use Mongo because of it's easy clustering. If this doesn't work reliably, this is a real concern for us and the service we provide to our customers! | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by DisU [ 18/Mar/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
I completely agree that this issue needs to be addressed, even as part of 2.2.4. Retrying five times is not an acceptable workaround. We're considering migrating away from Mongo because of this and similar issues. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Matt Bailey [ 18/Mar/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Is there anything we can do to mitigate this issue, or are we really going to have to wait for 2.6 for this to be addressed? | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by James Blackburn [ 14/Mar/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
We've been trialling Mongo here, and this is the bug that hurts us most. It's difficult to workaround too - as it's unclear how many stale connections the mongos has especially in the morning after a night without use. So we end up retrying 5 times (around every mongo operation) and sometimes that isn't enough. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by James Blackburn [ 01/Mar/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
I've written a multi-process batch job. What's interesting is that when this error occurs against the local mongos, it seems to occur consistently for one PID on the local machine at a time. Trying to get it to happen reliably is difficult, and it randomly fixes itself. But for the duration (order of minutes) it appears that the given mongo connection doesn't work. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by DisU [ 04/Feb/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Any news on this? We've been experiencing the same issue in the last few months. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Reid Morrison [ 28/Jan/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
I confirmed that on all of the Redhat Linux servers we have the tcp keep-alive time set to 300 | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Reid Morrison [ 28/Jan/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Yes, we are using the slaveOk read preference. The network connection issues are generally to the replica-set masters since they are on a different network segment. It is possible that the network connections to the master are going "stale" since all reads are to the slaves and writes are seldom performed. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Reid Morrison [ 28/Jan/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
We are running 5 shards each of which is a replica-set. We use the Ruby and Go-lang drivers so it is not likely to related to the driver in any way. To reproduce the problem: Step 7 may have to be repeated several times since the connections between mongos and the master are in a connection pool in mongos so there is no guarantee that the bad connection will be used on the first attempt Once mongos is in this state the only way to fix it is to restart the mongos process. It is not necessary to restart the client. The longer mongos is left running the greater the number of these bad connections in its connection pool, thereby increasing the likelihood of getting a bad connection when a call is made via mongos. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by James Blackburn [ 28/Jan/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
The shards are replicasets. We're using pymongo 2.4.1, the clients are coming in with a read-preference set to primary only. One thing which is (very) non-standard here is that we're LD_PRELOAD'ing of libsdp to run TCP/IP over Infiniband. It looks like this doesn't honour the TCP keepalive request – I see this in the mongos log files:
Other than these periodic OperationFailures, Mongo over infiniband seems to work very well indeed. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Randolph Tan [ 28/Jan/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
reidmorrison & jblackburn: Are the shards replica-sets or stand alone servers? If it's the former, what is the version of the driver you are using and are you using slaveOk/read preference? Thanks! | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Greg Studer [ 28/Jan/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Looking more into this. One potential fix/workaround may be keepalive - if it's not set to a low value already, does setting keepalive mitigate the problem? If anyone has a sample of the logs from mongos during the affected period (from the start of mongos, to include startup parameters), ideally at log level 3 (can be set at runtime), it would be useful to see exactly the timings of what is happening here on a per-conn basis. If they contain sensitive info, opening a SUPPORT ticket (which is user-private) linked to this ticket is an option as well. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by James Blackburn [ 23/Jan/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
We get a whole load of these, often in the morning. Which suggests the sockets are timing out overnight and not being noticed. Currently I'm logging all OperationFailures, and these look like:
Retrying seems to resolve the problem (I only see them logged once for a given client). It's a shame that they're not represented as AutoReconnect errors, as it means we have to put AutoReconnect and OperationFailure decorators around every single pymongo call. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Reid Morrison [ 04/Jan/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
We are definitely seeing a dramatic difference since upgrading production from V2.0.6 to V2.2.2. It only affects the mongos processes in our second (remote) data center that perform writes back to the shard masters in the first data center. Anytime there is a network issue between the 2 data centers all the mongos processes in the remote data center have to be restarted due to the send errors being sent back to the clients. We run one mongos process on every web and batch processing server to reduce network latency and to improve redundancy. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Reid Morrison [ 04/Jan/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
I can confirm that running the command below on the mongo router ( mongos ) does not help, a full restart of mongos is required: ) | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Randolph Tan [ 04/Jan/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Interesting, the mongos router process has been using pooled connections even before v2.0. We did a fix in | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Reid Morrison [ 04/Jan/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
We are running into a similar situation in production with MongoDB V2.2.2 It appears that with V2.2 the behavior in the Mongo Router process has changed from V2.0.6. In V2.0 when the Router experienced a network failure talking to a remote database server (shard master), it would return a failure message to the client and release the bad/failed connection. As of V2.2 the router process appears to be holding onto these failed connections and keeps re-using the bad connection resulting in the same error message on subsequent calls from the client to the router process. Can we change the mongo router to close/release these bad connections and not keep them around? Below is the error message we keep getting back from the mongo router process (the IP address is for the master in one of the shards) : Restarting the mongo router process fixes the issue. That is until a network failure occurs again and we are back to the constant send error above until we manually restart the router process. There is no information in the Mongo Router process log files about this issue. Everything is normal in the log file. The only interesting point is that the number of connections remains steady at 44, but these are probably just the client side connections, not the connections to the Mongo database servers. Any help with this issue will be greatly appreciated. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Joonas Reynders [ 18/Dec/12 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Same problem observed here. Our setup is 2 shards, each shard being a replica set of 3 nodes, 1 mongos, 1 config server (which runs on the same machine as one of the shards' primary) (this is dev setup btw). When restarting each node one at a time mongos starts to report this error. To be more exact the sh.status() does not report a problem but when trying to query anything we get SEND_ERROR (from the same replica set node that the config server also houses). After continuing redoing this query everything starts suddenly to work maybe 10 minutes or so after the node restart cycle. Running mongo 2.2.1 | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Anton V. Volokhov [ 23/Oct/12 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Same problem. Trying to access not partitioned database through mongos | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by larry.loi [ 19/Oct/12 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
I got the same problem, but the below 2 command unable to solve the problems. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kevin Matulef [ 13/Sep/12 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Enrico, Did you get the message above from mongos, or from one of the mongod shards? If it happens again, do you mind uploading the logs from the machine it happens on, and also the logs from mongo07.luan.local and mongo0.luan.local? This message indicates that it's having trouble contacting one of the shards, which could happen if it has a connection that's gone stale for some reason. If this is a mongos, you can force it to refresh its connections by typing:
or if it's a mongod you can type
|