Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Unresolved
Priority: Unknown
Fix Version/s: None
Component/s: CMAP, CSOT, Retryability
Labels:
None

Epic Link:
CSOT GA
Driver Changes:
Needed
Downstream Changes Summary:
Hide

Summary of necessary driver changes

Commits for syncing spec/prose tests
(and/or refer to an existing language POC if needed)

Context for other referenced/linked tickets
Show
Summary of necessary driver changes Commits for syncing spec/prose tests (and/or refer to an existing language POC if needed) Context for other referenced/linked tickets

Summary

To maintain availability and reduce churn, drivers are required to drain leftover server responses from connections that time out mid read (with CSOT + maxTimeMS). If the connection has been idle for more than 3 seconds, the CMAP specifications require an “aliveness” check while awaiting these pending responses:

If at least one byte can be read… a retryable error should be propagated to the operation layer.

This ticket defines an alternative behavior where if at least one byte is read, the driver could refresh the pending timer and continue draining the pending response within the same checkout. That is, we should push the retry logic to the connection layer.

Motivation

This solution ensures that the caller doesn’t waste operation retries on connection defects, and the pool avoids leaking dead/half-dead sockets back to the operation layer. This creates a clearer contract between caller and pool: if the pool returns an error, it has done everything possible to satisfy the request. The following is a sequence representing the pathalogical case:

Op starts
Pool chooses a connection in a "pending response" state outside timeout window
- Does aliveness check, passes
- Sends retryable error to op
Op retries and gets new pool

keeps doing this over and over again, leaking, say, 1 byte per second: never gets checked out or closed but holds up any operation that tries to use it.

This decentralized retryability behavior is further explored in DRIVERS-1262.

Acceptance Criteria

Push operation layer retry logic for awaiting a pending response to the connection layer. More specifically, the "Aliveness check” step should be updated to include following:

If at least one byte is read, the driver MUST refresh the pending timer and continue draining the pending response within the same checkout; it MUST NOT surface an intermediate error to the operation layer.

Additionally, the following changes are required:

We will also need to remove sections regarding pending await responses in the Retryable Reads and Writes specifications.
The “Checking Out a Connection” pseudocode currently short-circuits upward on a pending-response error. This needs to be changed to loop/close/try another connection before returning.
All unified spec tests and prose tests that assume checkOut failed events as part of retryability will need to be updated to exclude those checks.

Pitfalls

Despite future efforts (e.g. DRIVERS-1262), the current behavior of retryable reads and writes are operation-layer mechanisms.
Connection trickling bytes may indefinitely block a non-CSOT operation with this alternative treatment of the aliveness check.

related to

DRIVERS-2884 CSOT avoid connection churn when operations timeout

Ready for Work

Assignee:: Oleksandr Poliakov
Reporter:: Preston Vasquez
Votes:: 0 Vote for this issue
Watchers:: 5 Start watching this issue

Created:: Aug 27 2025 10:28:29 PM UTC
Updated:: Dec 01 2025 05:55:35 PM UTC

Details

Description

Summary

Motivation

Acceptance Criteria

Pitfalls

Attachments

Issue Links

Activity

People

Dates