[SERVER-6028] Too many open connections kills primary but doesn't trigger failover Created: 07/Jun/12 Updated: 08/Jan/24 Resolved: 10/Dec/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.0.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Colin Howe | Assignee: | Benjamin Caimano (Inactive) |
| Resolution: | Won't Fix | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Sprint: | Service Arch 2018-11-05, Service Arch 2018-11-19, Service Arch 2018-12-03, Service Arch 2018-12-17 | ||||||||
| Participants: | |||||||||
| Description |
|
Late last night we had some issues with mongos (unfortunately not clear what went wrong - bouncing fixed it). About an hour later we then had a massive spike in the number of connections from mongos to the primary. This then caused 'too many open connections' to start flooding the primaries logs and connection attempts throughout our application to consistently fail. In effect, our primary was dead. However, our primary was still telling all the secondaries that it was alive and well so no failover happened. I think the health checks need to do more than they do. The primary can't just be "alive" it must be "alive and well" - i.e. responding to queries and new connections. |
| Comments |
| Comment by Mira Carey [ 10/Dec/18 ] |
|
Closing this out, due to age, in favor of either |
| Comment by Mira Carey [ 10/Dec/18 ] |
|
My thinking here is that, assuming we hit maxConns during a connection storm, there is no reasonable, pure server side solution. Taking the scenario mentioned in the ticket, where mongos bursts down a primary, it's fairly obvious that fail over will only replicate the problem on the new primary. For that use case, our best option is either the MaxConnecting sharding task executor parameter (which exists today, and will work if the problem is our inability to re-use connections during a short period), or cooperation between mongos/drivers and mongod in the form of an explicit backpressure protocol (PM-1123). |
| Comment by Andrew Morrow (Inactive) [ 15/Oct/18 ] |
|
ben.caimano - Queueing this up for you for next sprint as DWS. |