[SERVER-57468] Enable TCP_USER_TIMEOUT by default Created: 04/Jun/21  Updated: 16/Feb/23

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Shane Harvey Assignee: Backlog - Service Architecture
Resolution: Unresolved Votes: 0
Labels: sa-remove-fv-backlog-22
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Duplicate
is duplicated by SERVER-69063 Fix TCP keepalive option setting Closed
Issue split
split to SERVER-69175 Add transport::SocketOption template ... Closed
Related
related to PYTHON-3035 Query to mongodb stuck if the cluster... Closed
Assigned Teams:
Service Arch
Backport Requested:
v6.1, v6.0, v5.0, v4.4, v4.2
Sprint: Service Arch 2022-12-26, Service Arch 2022-08-22, Service Arch 2022-09-05, Service Arch 2022-09-19, Service Arch 2022-10-31, Service Arch 2022-11-14, Service Arch 2022-11-28, Service Arch 2022-12-12, Service Arch 2023-01-09, Service Arch 2023-01-23, Service Arch 2023-02-06
Participants:
Case:

 Description   

The server should consider enabling TCP_USER_TIMEOUT for the same reasons described in DRIVERS-1692. This solves a problem where an operation could block for ~16 minutes instead of ~5 minutes (the server's default TCP keepalive period).

If the server does not do this automatically, admins can control this timeout behavior through the net.ipv4.tcp_retries2 setting.

$ sysctl net.ipv4.tcp_retries2
net.ipv4.tcp_retries2 = 15

tcp_retries2 - INTEGER
This value influences the timeout of an alive TCP connection,
when RTO retransmissions remain unacknowledged.
Given a value of N, a hypothetical TCP connection following
exponential backoff with an initial RTO of TCP_RTO_MIN would
retransmit N times before killing the connection at the (N+1)th RTO.

The default value of 15 yields a hypothetical timeout of 924.6
seconds and is a lower bound for the effective timeout.
TCP will effectively time out at the first RTO which exceeds the
hypothetical timeout.

RFC 1122 recommends at least 100 seconds for the timeout,
which corresponds to a value of at least 8.

https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt



 Comments   
Comment by Shane Harvey [ 19/Aug/22 ]

Agreed a non-default option SGTM.

Comment by Billy Donahue [ 19/Aug/22 ]

shane.harvey@mongodb.com I think we need to implement the option so we have it in our back pocket.
The documentation for it should definitely give the proper caveats about what it does.

Unlike tcp_retries2, it's a per-connection setting, so that's a big advantage.

Maybe we can LOG a warning if TCP_USER_TIMEOUT is set to a value that's incompatible with the 3 TCP_KEEP* values.

I'm also not thrilled that our TCP keepalive knobs are all but hardcoded.
we give them these values by default and there's no caller that gives non-default values.

https://github.com/10gen/mongo/blob/aadd70eef054d7a0fdb557b557c1cb7c108c52a5/src/mongo/util/net/socket_utils.h#L39-L40

inline constexpr Seconds kMaxKeepIdleSecs{300};
inline constexpr Seconds kMaxKeepIntvlSecs{1};

TCP_KEEPCNT is missing altogether and we don't ever adjust it. We should add that capability, too.

It seems like these will need to be adjustable in exactly the same way as TCP_USER_TIMEOUT for this to make sense as a holistic product feature.

Comment by Shane Harvey [ 19/Aug/22 ]

When implementing this feature we should be careful not to unintentionally increase the timeout for users that are already setting tcp_retries2 at the OS level. For example it would not be ideal to unconditionally set TCP_USER_TIMEOUT because it overrides tcp_retries2 and the user would end up with a longer retry period than they wanted.

It could be simpler to implement DRIVERS-1707 in mongos instead. The main idea in DRIVERS-1707 is to cancel in flight operations when a SDAM heartbeat fails with a network timeout. One caveat is that DRIVERS-1707 would only handle cluster connections on the mongos side, not intra replica set connections (eg. agg $out on a secondary).

Another important note is that TCP_USER_TIMEOUT overrides the TCP_KEEPCNT for keepalive, hence why the recommendation is to set TCP_USER_TIMEOUT to slightly less than TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT.

Comment by Billy Donahue [ 18/Aug/22 ]

sneak peek at this.
https://github.com/10gen/mongo/pull/7017/files
Still needs a test. I can see that the option is being set to the desired value, so the plumbing works.
I'm just not clear on how it works yet so that will take a little research to get a test for the functionality of the setting.

Generated at Thu Feb 08 05:41:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.