[SERVER-6028] Too many open connections kills primary but doesn't trigger failover Created: 07/Jun/12  Updated: 08/Jan/24  Resolved: 10/Dec/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.0.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Colin Howe Assignee: Benjamin Caimano (Inactive)
Resolution: Won't Fix Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-10982 Replica set may not fail over when pr... Closed
Operating System: ALL
Sprint: Service Arch 2018-11-05, Service Arch 2018-11-19, Service Arch 2018-12-03, Service Arch 2018-12-17
Participants:

 Description   

Late last night we had some issues with mongos (unfortunately not clear what went wrong - bouncing fixed it). About an hour later we then had a massive spike in the number of connections from mongos to the primary. This then caused 'too many open connections' to start flooding the primaries logs and connection attempts throughout our application to consistently fail. In effect, our primary was dead.

However, our primary was still telling all the secondaries that it was alive and well so no failover happened.

I think the health checks need to do more than they do. The primary can't just be "alive" it must be "alive and well" - i.e. responding to queries and new connections.



 Comments   
Comment by Mira Carey [ 10/Dec/18 ]

Closing this out, due to age, in favor of either SERVER-29237 (max connecting) or PM-1123 (the request backpressure epic)

Comment by Mira Carey [ 10/Dec/18 ]

My thinking here is that, assuming we hit maxConns during a connection storm, there is no reasonable, pure server side solution.

Taking the scenario mentioned in the ticket, where mongos bursts down a primary, it's fairly obvious that fail over will only replicate the problem on the new primary.  For that use case, our best option is either the MaxConnecting sharding task executor parameter (which exists today, and will work if the problem is our inability to re-use connections during a short period), or cooperation between mongos/drivers and mongod in the form of an explicit backpressure protocol (PM-1123). 

Comment by Andrew Morrow (Inactive) [ 15/Oct/18 ]

ben.caimano - Queueing this up for you for next sprint as DWS.

Generated at Thu Feb 08 03:10:34 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.