[DRIVERS-383] Enable and configure TCP Keepalive by default Created: 18/May/17  Updated: 12/May/23  Resolved: 10/Jun/20

Status: Closed
Project: Drivers
Component/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Roy Rim Assignee: Unassigned
Resolution: Done Votes: 5
Labels: newdriver
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on GODRIVER-37 Set TCP keep alive by default Closed
depends on RUBY-1283 Enable and configure TCP Keepalive by... Closed
depends on CDRIVER-2176 Enable and configure TCP Keepalive by... Closed
depends on CSHARP-1994 Enable and configure TCP Keepalive by... Closed
depends on CXX-1363 Have TCP keepalive default to true Closed
depends on JAVA-2531 Have TCP keepalive default to true Closed
depends on NODE-1024 Have TCP keepalive default to true Closed
depends on PHPC-969 Have TCP keepalive default to true Closed
depends on PYTHON-1279 Have TCP keepalive default to true Closed
depends on RUST-170 Enable and configure TCP Keepalive by... Closed
Related
related to RUBY-1211 Add Mongo::TCPSocket Keep-Alive Confi... Closed
related to SERVER-29341 Set TCP_KEEPIDLE and TCP_KEEPINTVL (o... Closed
related to GODRIVER-2846 Make expected TCP KeepAlive behavior ... Closed
is related to RUBY-1799 Connection options for tcp_keepalive_... Closed
Case:
Driver Compliance:
Key Status/Resolution FixVersion
NODE-1024 Done 3.0.0
PYTHON-1279 Fixed 3.5
SCALA-312 Works as Designed 2.2.0
JAVA-2531 Fixed 3.5.0
CSHARP-1994 Fixed 2.7.1
CXX-1363 Done
PHPC-969 Done 1.4.0-beta1, 1.4.0
CDRIVER-2176 Fixed 1.8.0
PERL-780 Fixed 2.1.0
GODRIVER-37 Fixed 0.0.1
RUBY-1283 Fixed 2.5.1
RUST-170 Fixed 1.1.0
SWIFT-485 Works as Designed

 Description   
Problem Description

keepalive in the Java driver (and other drivers) is disabled by default. This leaves the possibility of leaving downed server connections in the middle of a socket read stuck in a waiting state.

We had a situation where a mongos server crashed leaving 100 open connections on the client side. When we recovered the mongos the Java driver still had 100 bad connections taken from the pool and would not open new ones.

As part of this change, drivers should include in their documentation a link to the MongoDB Diagnostics FAQ keepalive section

Specification
  1. A driver MUST enable TCP keepalive by default. This matches the behavior of the MongoDB server.
  2. A driver MUST deprecate TCP keepalive-related options in the connection string (and any other way that it is configured), as there is no demonstrated benefit to allowing it to be disabled. This also matches the behavior of the server.
  3. A driver SHOULD set tcp_keepalive_time to 300 seconds unless it determines that the system default is already less than that. If the driver is unable to determine the system default at all it should not attempt to change it. This matches the behavior of the server as well.
  4. A driver SHOULD set tcp_keepalive_intvl to 10 seconds unless it determines that the system default is already less than that. If the driver is unable to determine the system default at all it should not attempt to change it. This is not the current behavior of the server, but if accepted here it will be recommended. The reasoning is that with the default of 75 seconds for this value and a default of 9 probes, the actual time to failure is 300 + (75 * 9) = 975 sec = 16.25 minutes. With a 10 second interval between probes it becomes a more reasonable 6.5 minutes.
  5. A driver SHOULD set tcp_keepalive_cnt to 9 probes unless it determines that the system default is already less than that. If the driver is unable to determine the system default at all it should not attempt to change it.
  6. A driver MUST document how keepalive-related options are configured. Drivers that can set tcp_keepalive_time and tcp_keepalive_intvl to the values mandated above MUST document that they do so. Drivers that can not MUST document that they do not and link to appropriate MongoDB Diagnostics FAQ keepalive section for instructions on setting these values at the system level.


 Comments   
Comment by Bernie Hackett [ 01/May/19 ]

charles.sarrazin (and anyone else running into issues with Azure), you can change keepalive at the OS level. See https://support.esri.com/en/technical-article/000006285. Note that the description above says "A driver SHOULD set <setting> to <value> unless it determines that the system default is already less than that. If the driver is unable to determine the system default at all it should not attempt to change it." So a driver won't override a smaller value already set at the OS level.

Comment by Charles Sarrazin (Inactive) [ 21/Mar/18 ]

Wouldn't 120 be more sane for keep alive time? I'm asking, as we recommend to set a keep-alive time of 120 for Azure deployments (due to the Azure load balancer killing connections after 240 seconds) in the production notes. And I did see issues for customers using the default keep-alive on Azure, for most tools (mongorestore w/ index creation, leading to a transparent drop of the connection, making the tool stall, no longer receiving responses queries issued to the server (index creation, as well as pending batches of inserts).

Comment by Jeremy Mikola [ 30/Aug/17 ]

sgupta@vertmarkets.com: Individual drivers are validated when their dependent ticket is resolved. In C#'s case, this that ticket is CSHARP-1994.

Comment by Swapna Gupta [ 30/Aug/17 ]

I noticed that the c# driver is not included in the validation section. This is an issue for the c# driver as well.

Comment by Bernie Hackett [ 15/Aug/17 ]

Note that some Operating Systems don't implement getsockopt for options like TCP_KEEPIDLE (though you can set those options):

https://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=52486

Also, macOS doesn't define TCP_KEEPIDLE. It defines TCP_KEEPALIVE (not to be confused with SO_KEEPALIVE!) for the same purpose.

Comment by Bernie Hackett [ 24/May/17 ]

That's a good point. I think we need a point 5 about documentation.

Comment by Jeffrey Yemin [ 24/May/17 ]

I can see warning if the equivalent of getsockopt/setsockopt actually fails. I'm less sure about warning for drivers in languages that don't even have the ability to call such methods at all (e.g. Java), as the warning would be logged in all circumstances.

Comment by Bernie Hackett [ 24/May/17 ]

Should the driver warn if it can't configure idle time or interval?

Comment by Jeffrey Yemin [ 24/May/17 ]

Proposed specification:

  1. A driver MUST enable TCP keepalive by default. This matches the behavior of the MongoDB server.
  2. A driver MUST deprecate TCP keepalive-related options in the connection string (and any other way that it is configured), as there is no demonstrated benefit to allowing it to be disabled. This also matches the behavior of the server.
  3. A driver SHOULD set tcp_keepalive_time to 300 seconds unless it determines that the system default is already less than that. If the driver is unable to determine the system default at all it should not attempt to change it. This matches the behavior of the server as well.
  4. A driver SHOULD set tcp_keepalive_intvl to 10 seconds unless it determines that the system default is already less than that. If the driver is unable to determine the system default at all it should not attempt to change it. This is not the current behavior of the server, but if accepted here it will be recommended. The reasoning is that with the default of 75 seconds for this value and a default of 9 probes, the actual time to failure is 300 + (75 * 9) = 975 sec = 16.25 minutes. With a 10 second interval between probes it becomes a more reasonable 6.5 minutes.
Comment by Bernie Hackett [ 23/May/17 ]

We need solid documentation to go along with this. We should clearly document in the docs for each driver that idle time must be set appropriately at the OS level for both client and server, and recommend an appropriate value (the server docs recommend 300 seconds).

Generated at Thu Feb 08 08:21:23 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.