Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Critical - P2
Fix Version/s: 2.6.0
Affects Version/s: 2.4.9
Component/s: Networking, Sharding
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Confidence Status:
None
Work Order:
0
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Issue Status as of Jan 09, 2015

ISSUE SUMMARY
Certain network settings and/or events may cause the connection pools used by MongoDB to be populated by "bad" or "broken" connections. Common causes included periodic network failures and firewalls silently killing long running connections, though the actual cause was sometimes impossible to ascertain.

These connections only reveal themselves to be unusable when they are selected from the pool and data is written to them, prior to that they appear to be healthy and usable. This is particularly relevant to large sharded clusters which contain many connection pools (each mongos process and each primary for a shard have connection pools that can be impacted).

USER IMPACT
When a triggering event occurs, some proportion of the idle connections in a connection pool may become unusable, but still look healthy. Over time, as the MongoDB process (mongod or mongos) attempts to use these connections from the pool they may fail, throwing socket exceptions (SEND_ERROR, recv() timeout etc.). These errors occur sporadically (depending on how many connections were affected, and how busy the process was) until such time as the "bad" connections in the pool are exhausted, or the process in question is restarted. Essentially, this often presents as seemingly random socket exceptions long after the trigger event had occurred.

WORKAROUNDS
If there is a suspected regular trigger event occurring then preventing the event in the first place is the best solution. If that proves elusive, the only definitive solution is to restart the impacted processes once such an event has occurred (or is suspected to have occurred) in order to clear out the problematic pools.

The releaseConnectionsAfterResponse parameter (added in 2.2.4 and 2.4.2 as part of ~~SERVER-9022~~) can help alleviate the issue, but does not eliminate it. Additionally, this parameter must be used judiciously and with caution, per the warning given in ~~SERVER-9022~~.

AFFECTED VERSIONS
MongoDB versions prior to 2.6.0 are affected by this issue.

FIX VERSION
The fix is included in the 2.6.0 production release.

RESOLUTION DETAILS
MongoDB 2.6 comes with a new connection pooling code that includes the work done in ~~SERVER-9041~~ to proactively detect the re-use of broken connections from the pool.

Original description

Like some other folks I was encountering the issue described in ~~SERVER-7008~~ (principally on a cluster with 32 mongos, and 20 mongod forming 10 shards, all running 2.4.9).

The occurrences were a bit random but tended to occur in the mornings and tended to occur early in the week (the latter probably correlated with weekly compaction that occurs on sat night).

The problem would always disappear for 1-2 weeks after a mongos restart.

After applying ~~SERVER-9022~~, the problems had appeared to have stopped. After ~6 weeks some nodes started to see SEND_ERROR exceptions however. As before a mongos restart fixed everything.

I confirmed that all the servers did have the patch applied (was: true)

is related to

SERVER-7008 socket exception [SEND_ERROR] on Mongo Sharding

Closed

Assignee:: Ramon Fernandez Marina
Reporter:: Alex Piggott
Participants:: Adam Comerford, Alex Piggott, Jérémie Charest, Ramon Fernandez Marina, Randolph Tan, sam flint, Srinivasa Kanamatha, Thomas Rueckstiess
Votes:: 6 Vote for this issue
Watchers:: 14 Start watching this issue

Created:: Mar 26 2014 02:22:43 AM UTC
Updated:: Jan 12 2015 09:39:24 PM UTC
Resolved:: Jan 12 2015 09:23:08 PM UTC

Details

Description

Original description

Attachments

Issue Links

Forms

Activity

People

Dates