Investigate CSOT retry behavior, returning previousError on timeout

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Unresolved
    • Priority: Unknown
    • None
    • Component/s: CSOT, Retryability
    • None
    • Not Needed

      Summary

      The retryable writes pseudocode includes logic to return previousError when CSOT is enabled and the operation times out:

        } else if (isExpired(timeoutMS)) {
          /* CSOT is enabled and the operation has timed out. */
          throw previousError;
        }
      

      (This was added in https://github.com/mongodb/specifications/commit/343ff9a3864e7141c5a056fbf19e71ca57e65740 but has no unified spec tests.)

      In the pseudocode, previousError is a regular retryable write error, where currentError is a CSOT timeout error. This contradicts the CSOT requirement that CSOT errors be distinguisable:

      If the timeoutMS option is set and the timeout expires, drivers MUST abort all blocking work and return control to the user with an error. This error MUST be distinguished in some way (e.g. custom exception type) to make it easier for users to detect when an operation fails due to a timeout.

      We should investigate how drivers can reconcile this with CSOT:

      For example, the Go Driver can use error wrapping: https://go.dev/play/p/-YicW6L9Uw6

      PR thread: https://github.com/mongodb/specifications/pull/1878#discussion_r2744107118

      Motivation

      Who is the affected end user?

      users who rely on CSOT with retryable operations. Downstream teams implementing the spec may also be affected if pseudocode is implemented 1:1.

      How does this affect the end user?

      Users may be confused when a timeout error occurs during retry but they cannot determine the underlying cause.

      How likely is it that this problem or use case will occur?

      Seems like an edge case. Requires CSOT enabled, a retryable error on the first attempt, and timeout during retry. More likely with short timeoutMS values or high-latency environments.

      If the problem does occur, what are the consequences and how severe are they?

      Users may have difficulty debugging failures

      Is this issue urgent?

      No. This is a spec clarification and test coverage gap.

      Is this ticket required by a downstream team?

      No.

      Is this ticket only for tests?

      Partially. The investigation may result in spec clarification (prose changes) and new test coverage. No functional driver changes expected unless drivers are found to be non-conformant (Node and Go Driver, for example)

      Acceptance Criteria

      Answer:

      1. Is the pseudocode reasonable?
      2. How should drivers implement this?

      Update "drivers changes" if more is required, i.e. test / spec sinks.

            Assignee:
            Unassigned
            Reporter:
            Preston Vasquez
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: