[SERVER-13352] Socket exception (SEND_ERROR) even after SERVER-9022 applied Created: 26/Mar/14 Updated: 12/Jan/15 Resolved: 12/Jan/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Networking, Sharding |
| Affects Version/s: | 2.4.9 |
| Fix Version/s: | 2.6.0 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Alex Piggott | Assignee: | Ramon Fernandez Marina |
| Resolution: | Done | Votes: | 6 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Operating System: | ALL | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
Issue Status as of Jan 09, 2015 ISSUE SUMMARY These connections only reveal themselves to be unusable when they are selected from the pool and data is written to them, prior to that they appear to be healthy and usable. This is particularly relevant to large sharded clusters which contain many connection pools (each mongos process and each primary for a shard have connection pools that can be impacted). USER IMPACT WORKAROUNDS The releaseConnectionsAfterResponse parameter (added in 2.2.4 and 2.4.2 as part of AFFECTED VERSIONS FIX VERSION RESOLUTION DETAILS Original descriptionLike some other folks I was encountering the issue described in The occurrences were a bit random but tended to occur in the mornings and tended to occur early in the week (the latter probably correlated with weekly compaction that occurs on sat night). The problem would always disappear for 1-2 weeks after a mongos restart. After applying I confirmed that all the servers did have the patch applied (was: true) |
| Comments |
| Comment by Alex Piggott [ 09/Jan/15 ] | |||||||||||||
|
(I'm the original reporter for this, we saw them regularly on a large Since moving to mongodb 2.6 we have stopped seeing these problems | |||||||||||||
| Comment by sam flint [ 03/Aug/14 ] | |||||||||||||
|
I am seeing issues as well. This started as I added in 5th mongod to our cluster on each shard. Currently on 2.4.8 in production. | |||||||||||||
| Comment by Srinivasa Kanamatha [ 29/Jul/14 ] | |||||||||||||
|
We are seeing similar issues in our environment. Current version we are on 2.4.5. Here is the error "socket exception [SEND_ERROR] for ..." | |||||||||||||
| Comment by Adam Comerford [ 29/Jul/14 ] | |||||||||||||
|
Hi Jérémie, I am the one responsible for writing and recording M202 and I just wanted to clarify the course content regarding connection pooling. I do indeed talk about the need to restart the mongos in certain scenarios, however that section of the course actually states that 2.6 has made it less likely that you need to restart the mongos, so I think there is something of a misunderstanding. When I speak about the need to restart I am referring to versions 2.4 (and below) because there have been specific improvements made in 2.6 around how to handle connections in the pool that have "gone bad". Therefore, the upgrade to 2.6 is definitely a recommended action for people seeing this problem, because (as mentioned in the course) the changes are too complex to backport to 2.4. Adam | |||||||||||||
| Comment by Alex Piggott [ 07/Jul/14 ] | |||||||||||||
|
@Thomas: sorry I haven't had a chance to test it (the change in package structure for 2.6 has messed up our deployment infrastructure, obviously not a big issue to update but does mean I've been waiting for someone to have some free time to do it, vs just being able to update one of our reference instances in 5 minutes) | |||||||||||||
| Comment by Jérémie Charest [ 07/Jul/14 ] | |||||||||||||
|
I took the "M202: MongoDB Advanced Deployment and Operations" course recently on MongoDB University. At some point during the course, one presentation illustrated mongos connection pooling and clearly state that under mongo 2.6 we have to restart the mongos if that kind of network/socket error occur (near 7min in the video). https://www.youtube.com/watch?v=v8IGPu8XLKo Will it be fix in 2.4.x or we absolutely need to uprade to 2.6? I think it might be a good to mention that information in the mongos documentation. | |||||||||||||
| Comment by Thomas Rueckstiess [ 07/Jul/14 ] | |||||||||||||
|
Hi Alex, Have you had a chance to upgrade your system already? If so, please let us know if the issue reoccurred or if 2.6 fixed the problem for you. Thanks, | |||||||||||||
| Comment by Randolph Tan [ 08/May/14 ] | |||||||||||||
|
I tried following the steps in 2.4.9 and I was able to see the socket exception from the mongos, but it does resolve by itself afterwards so I am not sure if we were seeing the same thing. It is still possible to see the same error in 2.6, but since we actively check the status of the connections, the occurrence of this issue is minimized. | |||||||||||||
| Comment by Alex Piggott [ 08/May/14 ] | |||||||||||||
|
Randolph - thanks for the update, we haven't moved across to 2.6.1 yet, though it's on our TODO list (we're stockpiling empty chunks!), we'll do that in the next couple of weeks and report back 6 weeks later on whether the issues have stopped Though seems like Jeremie had possibly found a simple way to reproduce so checking if that is fixed would be a good start | |||||||||||||
| Comment by Randolph Tan [ 08/May/14 ] | |||||||||||||
|
Have you tried v2.6.1? It has a fix for | |||||||||||||
| Comment by Jérémie Charest [ 24/Apr/14 ] | |||||||||||||
|
We are experiencing the same behavior on a test shard : 2 repsets (1 primary and 2 slaves on each). Restarting the mongos server "fix" it. The bug Steps to reproduce :
infos :
mongos logs:
| |||||||||||||
| Comment by Alex Piggott [ 26/Mar/14 ] | |||||||||||||
|
Examples from log file when it occurs:
Interesting, I hadn't noticed the timeouts when the problem occurred before the (EDIT: in case it's causing confusion, note the mongod instances live on port (27017 + shardid) where in this case shardid=1...10, mongos lives on 27017, and the configdb lives on port 27016) EDIT2: note the significance of the timestamps is that there's an hourly test that runs which calls ,amongst other things, db.collection.count() which seems to be the call that fails the most frequently. Note also it's the same (stale?) connection being used in both cases. |