[DRIVERS-2347] Prevent conflating operation timeout with connection establishment timeout Created: 03/Jun/22  Updated: 06/Nov/23

Status: In Progress
Project: Drivers
Component/s: CMAP, CSOT
Fix Version/s: None

Type: Improvement Priority: Unknown
Reporter: Matt Dale Assignee: Shane Harvey
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
Issue split
split to CDRIVER-4681 Prevent conflating operation timeout ... Blocked
split to CSHARP-4715 Prevent conflating operation timeout ... Blocked
split to CXX-2712 Prevent conflating operation timeout ... Blocked
split to GODRIVER-2903 Prevent conflating operation timeout ... Blocked
split to JAVA-5069 Prevent conflating operation timeout ... Blocked
split to MOTOR-1151 Prevent conflating operation timeout ... Blocked
split to NODE-5451 Prevent conflating operation timeout ... Blocked
split to PHPLIB-1193 Prevent conflating operation timeout ... Blocked
split to PYTHON-3834 Prevent conflating operation timeout ... Blocked
split to RUBY-3295 Prevent conflating operation timeout ... Blocked
split to RUST-1700 Prevent conflating operation timeout ... Blocked
Related
is related to GODRIVER-2038 Use "ConnectionTimeout" for creating ... Closed
Epic Link: DRIVERS-555
Driver Changes: Needed
Quarter: FY24Q3, FY25Q1
Engineering Lead: Steve Silvester Steve Silvester
Program Manager: Esha Bhargava Esha Bhargava
Start date:
Driver Compliance:
Key Status/Resolution FixVersion
CDRIVER-4681 Blocked
CXX-2712 Blocked
CSHARP-4715 Blocked
GODRIVER-2903 Blocked
JAVA-5069 Blocked
NODE-5451 Blocked
MOTOR-1151 Blocked
PYTHON-3834 Blocked
PHPLIB-1193 Blocked
RUBY-3295 Blocked
RUST-1700 Blocked

 Description   

Summary

The process for checking out and establishing connections described in the CMAP spec combined with the timeout behavior described in the Server Selection cause issues when users specify low operation timeouts.

Specifically, the CMAP spec describes that if there are no available connections, a connection pool should establish a new connection in-line with the check-out. The CSOT spec describes that the timeout used to create a TCP/TLS connection is min(connectTimeoutMS, min(serverSelectionTimeoutMS, remaining timeoutMS)) and the timeout used to handshake with the MongoDB server should be min(operationTimeout, remaining computedServerSelectionTimeout). As a result, if an operation times out, any in-progress connection establishment necessarily times out as well. If most operations have low timeouts (e.g. 1-5 seconds), the driver may not have enough time to establish any new connections, leading to a state where the driver cannot create any connections.

Consider the case discovered in the Go driver, which supports client-side operation timeouts via the Go context.Context type:

The current Go driver connection pool creates most new connections in-line with operations when there are no available idle connections. As a result, the Go driver times out new connection creation on either connectTimeoutMS or the operation context.Context timeout, whichever is shorter. The conflation of operation timeout with new connection creation timeout makes using operation timeouts in the Go driver complicated. The problem is exacerbated when the majority of operations use low operation timeouts (e.g. 1-5 second timeouts), when the operation frequency is high and bursty (e.g 1,000+ op/sec), and when the time to establish a new connection is high (e.g. when TLS and auth handshake are enabled).

The result is that the current Go driver can enter a state where all connections in a connection pool are perished (usually caused by a connection pool clear due to encountering a timeout during new connection establishment) and almost no new connections can be established in-line with an operation due to the low operation timeout. Customers using the Go driver may encounter application outages as a result of the Go driver’s inability to recover from this state.

Note that the above was relevant for Go Driver v1.7.x and earlier. See GODRIVER-2038 for more context.

Any driver that implements a connection pool as described in the current CMAP spec and implments client-side operation timeout as described in the current CSOT spec will potentially encounter the same issue discovered in the Go driver. To prevent that issue, drivers must never let operation timeout influence connection establishment timeout. Honoring both operation timeout and connection establishment timeout requires running connection check-out and connection establishment in different threads. Drivers should continue connection establishment for connectTimeoutMS, even if the check-out that requested the new connection times out.

Update the CMAP spec to describe the necessary separation of threads of execution between connection check-out and connection establishment. Update the CSOT spec to describe that connection establishment should always use a timeout of connectionTimeoutMS, independent of operation timeout. Note that implementing CSOT likely requires drivers to refactor their connection pool implementations.

Note that an alternative to separate threads is to always continue establishing connections for connectTimeoutMS, even if the operation timeout has expired.

Consider the list of drivers that establish connections in the checkOut function

Motivation

Who is the affected end user?

Users using drivers that support client-side operation timeouts, especially users who set low timeouts (1-5 seconds) and run services with high operation volumes (1,000+ op/sec).

How does this affect the end user?

The driver may enter a state where it cannot create any new connections for a long period of time. The user's services may experience extended outages if that happens.

How likely is it that this problem or use case will occur?

Fairly likely for users who set low timeouts (1-5 seconds) and run services with high operation volumes (1,000+ op/sec).

If the problem does occur, what are the consequences and how severe are they?

The user's services may experience extended periods where the driver cannot establish connections and cannot do any work, either at startup or intermittently during the operation of the service.

Is this issue urgent?

Must be completed before DRIVERS-555 can be implemented in most drivers.

Is this ticket required by a downstream team?

No.

Is this ticket only for tests?

No.



 Comments   
Comment by Shane Harvey [ 03/Jun/22 ]

Is this a duplicate of DRIVERS-1801?

Generated at Thu Feb 08 08:25:21 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.