-
Type: Improvement
-
Resolution: Unresolved
-
Priority: Unknown
-
None
-
None
-
Needed
Summary
The process for checking out and establishing connections described in the CMAP spec combined with the timeout behavior described in the Server Selection cause issues when users specify low operation timeouts.
Specifically, the CMAP spec describes that if there are no available connections, a connection pool should establish a new connection in-line with the check-out. The CSOT spec describes that the timeout used to create a TCP/TLS connection is min(connectTimeoutMS, min(serverSelectionTimeoutMS, remaining timeoutMS)) and the timeout used to handshake with the MongoDB server should be min(operationTimeout, remaining computedServerSelectionTimeout). As a result, if an operation times out, any in-progress connection establishment necessarily times out as well. If most operations have low timeouts (e.g. 1-5 seconds), the driver may not have enough time to establish any new connections, leading to a state where the driver cannot create any connections.
Consider the case discovered in the Go driver, which supports client-side operation timeouts via the Go context.Context type:
The current Go driver connection pool creates most new connections in-line with operations when there are no available idle connections. As a result, the Go driver times out new connection creation on either connectTimeoutMS or the operation context.Context timeout, whichever is shorter. The conflation of operation timeout with new connection creation timeout makes using operation timeouts in the Go driver complicated. The problem is exacerbated when the majority of operations use low operation timeouts (e.g. 1-5 second timeouts), when the operation frequency is high and bursty (e.g 1,000+ op/sec), and when the time to establish a new connection is high (e.g. when TLS and auth handshake are enabled).
The result is that the current Go driver can enter a state where all connections in a connection pool are perished (usually caused by a connection pool clear due to encountering a timeout during new connection establishment) and almost no new connections can be established in-line with an operation due to the low operation timeout. Customers using the Go driver may encounter application outages as a result of the Go driver’s inability to recover from this state.
Note that the above was relevant for Go Driver v1.7.x and earlier. See GODRIVER-2038 for more context.
Any driver that implements a connection pool as described in the current CMAP spec and implments client-side operation timeout as described in the current CSOT spec will potentially encounter the same issue discovered in the Go driver. To prevent that issue, drivers must never let operation timeout influence connection establishment timeout. Honoring both operation timeout and connection establishment timeout requires running connection check-out and connection establishment in different threads. Drivers should continue connection establishment for connectTimeoutMS, even if the check-out that requested the new connection times out.
Update the CMAP spec to describe the necessary separation of threads of execution between connection check-out and connection establishment. Update the CSOT spec to describe that connection establishment should always use a timeout of connectionTimeoutMS, independent of operation timeout. Note that implementing CSOT likely requires drivers to refactor their connection pool implementations.
Note that an alternative to separate threads is to always continue establishing connections for connectTimeoutMS, even if the operation timeout has expired.
Consider the list of drivers that establish connections in the checkOut function
Motivation
Who is the affected end user?
Users using drivers that support client-side operation timeouts, especially users who set low timeouts (1-5 seconds) and run services with high operation volumes (1,000+ op/sec).
How does this affect the end user?
The driver may enter a state where it cannot create any new connections for a long period of time. The user's services may experience extended outages if that happens.
How likely is it that this problem or use case will occur?
Fairly likely for users who set low timeouts (1-5 seconds) and run services with high operation volumes (1,000+ op/sec).
If the problem does occur, what are the consequences and how severe are they?
The user's services may experience extended periods where the driver cannot establish connections and cannot do any work, either at startup or intermittently during the operation of the service.
Is this issue urgent?
Must be completed before DRIVERS-555 can be implemented in most drivers.
Is this ticket required by a downstream team?
No.
Is this ticket only for tests?
No.
- is related to
-
GODRIVER-2038 Use "ConnectionTimeout" for creating all new connections and background connection creation
- Closed
-
DRIVERS-2750 Clarify connection checkout timeout when timeoutMS is enabled
- Backlog
- related to
-
DRIVERS-2971 Read server responses after client-side timeouts
- Backlog
-
NODE-5780 Update Timeouts to use CSOT in OIDC Machine Callback
- Blocked
- split to
-
CDRIVER-4681 Prevent conflating operation timeout with connection establishment timeout
- Blocked
-
CSHARP-4715 Prevent conflating operation timeout with connection establishment timeout
- Blocked
-
CXX-2712 Prevent conflating operation timeout with connection establishment timeout
- Blocked
-
GODRIVER-2903 Prevent conflating operation timeout with connection establishment timeout
- Blocked
-
JAVA-5069 Prevent conflating operation timeout with connection establishment timeout
- Blocked
-
MOTOR-1151 Prevent conflating operation timeout with connection establishment timeout
- Blocked
-
NODE-5451 Prevent conflating operation timeout with connection establishment timeout
- Blocked
-
PHPLIB-1193 Prevent conflating operation timeout with connection establishment timeout
- Blocked
-
PYTHON-3834 Prevent conflating operation timeout with connection establishment timeout
- Blocked
-
RUBY-3295 Prevent conflating operation timeout with connection establishment timeout
- Blocked
-
RUST-1700 Prevent conflating operation timeout with connection establishment timeout
- Blocked