[Investigation] Retry pending response drain in connection check out

XMLWordPrintableJSON

    • Type: Improvement
    • Resolution: Unresolved
    • Priority: Unknown
    • None
    • Component/s: CMAP, CSOT, Retryability
    • None
    • Needed
    • Hide

      Summary of necessary driver changes

      •  

      Commits for syncing spec/prose tests
      (and/or refer to an existing language POC if needed)

      •  

      Context for other referenced/linked tickets

      •  
      Show
      Summary of necessary driver changes   Commits for syncing spec/prose tests (and/or refer to an existing language POC if needed)   Context for other referenced/linked tickets  

      Summary

      To maintain availability and reduce churn, drivers are required to drain leftover server responses from connections that time out mid read (with CSOT + maxTimeMS). If the connection has been idle for more than 3 seconds, the CMAP specifications require an “aliveness” check while awaiting these pending responses:

      If at least one byte can be read… a retryable error should be propagated to the operation layer.

      This ticket defines an alternative behavior where if at least one byte is read, the driver could refresh the pending timer and continue draining the pending response within the same checkout. That is, we should push the retry logic to the connection layer.

      Motivation

      This solution ensures that the caller doesn’t waste operation retries on connection defects, and the pool avoids leaking dead/half-dead sockets back to the operation layer. This creates a clearer contract between caller and pool: if the pool returns an error, it has done everything possible to satisfy the request. The following is a sequence representing the pathalogical case:

      • Op starts
      • Pool chooses a connection in a "pending response" state outside timeout window
        • Does aliveness check, passes
        • Sends retryable error to op
      • Op retries and gets new pool

      keeps doing this over and over again, leaking, say, 1 byte per second: never gets checked out or closed but holds up any operation that tries to use it.

      This decentralized retryability behavior is further explored in DRIVERS-1262.

      Acceptance Criteria

      Push operation layer retry logic for awaiting a pending response to the connection layer. More specifically, the "Aliveness check” step should be updated to include following:

      If at least one byte is read, the driver MUST refresh the pending timer and continue draining the pending response within the same checkout; it MUST NOT surface an intermediate error to the operation layer.

      Additionally, the following changes are required:

      1. We will also need to remove sections regarding pending await responses in the Retryable Reads and Writes specifications.
      2. The “Checking Out a Connection” pseudocode currently short-circuits upward on a pending-response error. This needs to be changed to loop/close/try another connection before returning.
      3. All unified spec tests and prose tests that assume checkOut failed events as part of retryability will need to be updated to exclude those checks.

      Pitfalls

      • Despite future efforts (e.g. DRIVERS-1262), the current behavior of retryable reads and writes are operation-layer mechanisms.
      • Connection trickling bytes may indefinitely block a non-CSOT operation with this alternative treatment of the aliveness check.

            Assignee:
            Unassigned
            Reporter:
            Preston Vasquez
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: