-
Type: Improvement
-
Resolution: Unresolved
-
Priority: Unknown
-
None
-
Component/s: CSOT
-
None
-
Needed
Summary
The CSOT spec requires that drivers add maxTimeMS to commands when a timeout is specified. According to the CSOT spec, the goal of adding maxTimeMS is to minimize connection churn:
When constructing a command, drivers use the timeoutMS option to derive a value for the maxTimeMS command option and the socket timeout. The full time to round trip a command is (network RTT + server-side execution time). If both maxTimeMS and socket timeout were set to the same value, the server would never be able to respond with a MaxTimeMSExpired error because drivers would hit the socket timeout first and close the connection. This would lead to connection churn if the specified timeout is too low. To allow the server to gracefully error and avoid churn, drivers must account for the network round trip in the maxTimeMS calculation.
However, we received a customer report that enabling the automatic maxTimeMS feature in the Go Driver was not preventing connection churn (see HELP-56519). We found that using network round-trip time to predict maxTimeMS was ineffective at getting the server to respond before the client-side timeout was up. Furthermore, we found that for very low client-side timeouts (less than 1 second), there was no maxTimeMS value that would allow the server to respond before the client-side timeout was up. See experimentation in GODRIVER-3152, and here and here.
Instead, we found that setting maxTimeMS does consistently get the server to respond soon after the client side timeout (99% of responses were available within 1s after the client-side timeout). To minimize connection churn, drivers should attempt to read the server response on a connection after a client-side timeout, instead of immediately closing the connection.
There are multiple ways to implement that behavior, including:
- When an operation reaches a client-side timeout, mark the connection as awaiting a server response. When the connection is checked in, start a background routine that awaits the "stale" server response before making the connection available to check out again.
- When an operation reaches a client-side timeout, mark the connection as awaiting a server response and check-in the connection. When another operation checks out that connection, try to read the "stale" server response before running the new operation.
Drivers should only attempt to read the "stale" server response for up to a maximum amount of time (timeout TBD). If the "stale" server response is not available within that timeout, drivers should close the connection.
Motivation
Who is the affected end user?
Customers who have use client-side timeouts, especially customers with high-volume applications that set low client-side timeouts.
How does this affect the end user?
During events that increase client-side operation timeouts, customers may experience increased connection churn (i.e. increased rates of connections being closed and opened). The increased connection churn may also increase database load and make operations slower. The slower operations may cause more timeouts, leading to more connection churn.
How likely is it that this problem or use case will occur?
Any customer who sets low operation timeouts in high-volume applications may experience enough connection churn to notice performance degradation in their application.
If the problem does occur, what are the consequences and how severe are they?
The customer's application may experience anything between intermittent performance degradation to total loss of database availability.
Is this issue urgent?
No.
Is this ticket required by a downstream team?
No.
Is this ticket only for tests?
No.
Acceptance Criteria
What specific requirements must be met to consider the design phase complete?
- duplicates
-
DRIVERS-2884 CSOT avoid connection churn when operations timeout
- Investigating
- is related to
-
GODRIVER-3302 Panic when reading data from connection
- Closed
-
GODRIVER-3172 Read responses in the background after an operation timeout
- Closed
-
DRIVERS-2347 Prevent conflating operation timeout with connection establishment timeout
- In Progress
-
GODRIVER-3152 Set maxTimeMS to minimize connection churn
- Closed