-
Type: Improvement
-
Resolution: Unresolved
-
Priority: Unknown
-
None
-
Component/s: CSOT
-
None
-
Needed
-
-
(copied to CRM)
Summary
With CSOT, drivers set the socket read timeout to "remaining timeout" and set maxTimeMS to "remaining timeout - minRTT" so that the server has some time to respond with a MaxTimeMSExpired error. However, in practice the network and server latency varies and it's possible the client hits the socket timeout instead of reading the MaxTimeMSExpired error. This is not ideal because drivers close the connection after hitting a socket timeout which leads to connection churn.
It would be better to avoid connection churn when operations timeout. One way to accomplish this is to:
- keep the connection open after a socket timeout
- mark the connection with a pending read
- check the connection back into the pool
Then some subsequent operation will:
- check out the connection from the pool
- check if the connection has a pending read, if so complete the read.
- continue running the next operation normally.
This design avoids the connection churn. It also implicitly enforces some back pressure as the next operation won't be sent to the server until the pending operation completes.
A proof of concept in PyMongo is implemented here: https://github.com/ShaneHarvey/mongo-python-driver/commit/d6e4c877d1972a1cba85673eb91a0c1dfcd185a9
Motivation
An example of this poor behavior is described in HELP-56519, where a latency spike on the server resulted in many socket timeouts, triggered a connection storm due to the churn, and contributed even more load on the server.
Who is the affected end user?
All users of CSOT.
How does this affect the end user?
Connection churn + poor performance as connections need to be reopened.
How likely is it that this problem or use case will occur?
Likely.
If the problem does occur, what are the consequences and how severe are they?
The resulting connection storm can overwhelm nodes in the cluster.
Is this issue urgent?
TBD.
Is this ticket required by a downstream team?
TBD.
Is this ticket only for tests?
Both spec changes and tests.
Acceptance Criteria
- is duplicated by
-
DRIVERS-2971 Read server responses after client-side timeouts
- Backlog
- is related to
-
SERVER-96344 mongos doesn't honor a failpoint's full blockTimeMS
- Needs Scheduling
-
GODRIVER-3414 Server selection should not apply to connection checkout
- Backlog
-
DRIVERS-2868 Clarify maxAwaitTimeMS adjustment by timeoutMS and RTT
- Investigating
- related to
-
GODRIVER-3172 Read responses in the background after an operation timeout
- Closed
- split to
-
CDRIVER-5526 CSOT avoid connection churn when operations timeout
- Blocked
-
CSHARP-5024 CSOT avoid connection churn when operations timeout
- Blocked
-
CXX-2998 CSOT avoid connection churn when operations timeout
- Blocked
-
GODRIVER-3173 CSOT avoid connection churn when operations timeout
- Blocked
-
JAVA-5399 CSOT avoid connection churn when operations timeout
- Blocked
-
NODE-6062 CSOT avoid connection churn when operations timeout
- Blocked
-
PHPLIB-1425 CSOT avoid connection churn when operations timeout
- Blocked
-
PYTHON-4324 CSOT avoid connection churn when operations timeout
- Blocked
-
RUBY-3432 CSOT avoid connection churn when operations timeout
- Blocked
-
RUST-1903 CSOT avoid connection churn when operations timeout
- Blocked
-
MOTOR-1291 CSOT avoid connection churn when operations timeout
- Closed