[DRIVERS-383] Enable and configure TCP Keepalive by default Created: 18/May/17 Updated: 12/May/23 Resolved: 10/Jun/20 |
|
| Status: | Closed |
| Project: | Drivers |
| Component/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Roy Rim | Assignee: | Unassigned |
| Resolution: | Done | Votes: | 5 |
| Labels: | newdriver | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Driver Compliance: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
Problem Descriptionkeepalive in the Java driver (and other drivers) is disabled by default. This leaves the possibility of leaving downed server connections in the middle of a socket read stuck in a waiting state. We had a situation where a mongos server crashed leaving 100 open connections on the client side. When we recovered the mongos the Java driver still had 100 bad connections taken from the pool and would not open new ones. As part of this change, drivers should include in their documentation a link to the MongoDB Diagnostics FAQ keepalive section Specification
|
| Comments |
| Comment by Bernie Hackett [ 01/May/19 ] |
|
charles.sarrazin (and anyone else running into issues with Azure), you can change keepalive at the OS level. See https://support.esri.com/en/technical-article/000006285. Note that the description above says "A driver SHOULD set <setting> to <value> unless it determines that the system default is already less than that. If the driver is unable to determine the system default at all it should not attempt to change it." So a driver won't override a smaller value already set at the OS level. |
| Comment by Charles Sarrazin (Inactive) [ 21/Mar/18 ] |
|
Wouldn't 120 be more sane for keep alive time? I'm asking, as we recommend to set a keep-alive time of 120 for Azure deployments (due to the Azure load balancer killing connections after 240 seconds) in the production notes. And I did see issues for customers using the default keep-alive on Azure, for most tools (mongorestore w/ index creation, leading to a transparent drop of the connection, making the tool stall, no longer receiving responses queries issued to the server (index creation, as well as pending batches of inserts). |
| Comment by Jeremy Mikola [ 30/Aug/17 ] |
|
sgupta@vertmarkets.com: Individual drivers are validated when their dependent ticket is resolved. In C#'s case, this that ticket is |
| Comment by Swapna Gupta [ 30/Aug/17 ] |
|
I noticed that the c# driver is not included in the validation section. This is an issue for the c# driver as well. |
| Comment by Bernie Hackett [ 15/Aug/17 ] |
|
Note that some Operating Systems don't implement getsockopt for options like TCP_KEEPIDLE (though you can set those options): https://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=52486 Also, macOS doesn't define TCP_KEEPIDLE. It defines TCP_KEEPALIVE (not to be confused with SO_KEEPALIVE!) for the same purpose. |
| Comment by Bernie Hackett [ 24/May/17 ] |
|
That's a good point. I think we need a point 5 about documentation. |
| Comment by Jeffrey Yemin [ 24/May/17 ] |
|
I can see warning if the equivalent of getsockopt/setsockopt actually fails. I'm less sure about warning for drivers in languages that don't even have the ability to call such methods at all (e.g. Java), as the warning would be logged in all circumstances. |
| Comment by Bernie Hackett [ 24/May/17 ] |
|
Should the driver warn if it can't configure idle time or interval? |
| Comment by Jeffrey Yemin [ 24/May/17 ] |
|
Proposed specification:
|
| Comment by Bernie Hackett [ 23/May/17 ] |
|
We need solid documentation to go along with this. We should clearly document in the docs for each driver that idle time must be set appropriately at the OS level for both client and server, and recommend an appropriate value (the server docs recommend 300 seconds). |