.selectServer leaks local timeout on retry path

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Declined
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • 1
    • Not Needed
    • None
    • Hide

      1. What would you like to communicate to the user about this feature?
      2. Would you like the user to see examples of the syntax and/or executable code and its output?
      3. Which versions of the driver/connector does this apply to?

      Show
      1. What would you like to communicate to the user about this feature? 2. Would you like the user to see examples of the syntax and/or executable code and its output? 3. Which versions of the driver/connector does this apply to?
    • None
    • None
    • None
    • None
    • None
    • None

      Use Case

      As a... user running retryable operations on sharded clusters
      I want... server selection timeouts to be properly cleaned up after retries
      So that... my application does not crash

      User Experience

      When a retryable write (or read) fails with a transient error inside a transaction on a sharded cluster, the driver correctly retries and the operation succeeds. However, ~30 seconds later, an unhandled promise rejection fires with TimeoutError: Expired after 30000ms crashes the process.

      This affects any user on a sharded cluster using explicit transactions who encounters a transient retryable error. The error itself is handled correctly (retry succeeds), but the leaked timeout causes a delayed crash.

      It's been discovered during Client Backpressure implementation (NODE-7142) but is independent of it, the only reason CB surfaced it because of 5 retries in longer timeouts (backoff + jitter), but the problem exists on main as well.

      Dependencies

      • none

      Risks/Unknowns

      • none

      Acceptance Criteria

      Implementation Requirements

      change 

      if (options.timeoutContext?.clearServerSelectionTimeout) timeout?.clear();

        to:

      if (!options.timeoutContext || options.timeoutContext.clearServerSelectionTimeout) {
        timeout?.clear();
      }

      The logic: clear the timeout if (a) no timeoutContext was provided (we created a local timeout), or (b) the timeoutContext explicitly says to clear it (Legacy path). Do not clear when CSOT owns the timeout (clearServerSelectionTimeout === false).

      Testing Requirements

      Add a regression test that exercises the retry path on a sharded cluster with a pinned transaction session, verifying no unhandled rejections occur after the operation completes. This can be done by:

      1. Starting a transaction on a sharded cluster
      2. Using failCommand to trigger a retryable error on a write inside the transaction
      3. Verifying the operation succeeds (retry works)
      4. Waiting >30 seconds (or using a shorter serverSelectionTimeoutMS) and asserting no unhandled rejection fires

      Documentation Requirements

      • DOCSP ticket, API docs, etc

      Follow Up Requirements

      • additional tickets to file, required releases, etc
      • if node behavior differs/will differ from other drivers, confirm with dbx devs what standard to aim for and what plan, if any, exists to reconcile the diverging behavior moving forward

            Assignee:
            Sergey Zelenov
            Reporter:
            Sergey Zelenov
            None
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: