-
Type:
Bug
-
Resolution: Declined
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
1
-
Not Needed
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Use Case
As a... user running retryable operations on sharded clusters
I want... server selection timeouts to be properly cleaned up after retries
So that... my application does not crash
User Experience
When a retryable write (or read) fails with a transient error inside a transaction on a sharded cluster, the driver correctly retries and the operation succeeds. However, ~30 seconds later, an unhandled promise rejection fires with TimeoutError: Expired after 30000ms crashes the process.
This affects any user on a sharded cluster using explicit transactions who encounters a transient retryable error. The error itself is handled correctly (retry succeeds), but the leaked timeout causes a delayed crash.
It's been discovered during Client Backpressure implementation (NODE-7142) but is independent of it, the only reason CB surfaced it because of 5 retries in longer timeouts (backoff + jitter), but the problem exists on main as well.
Dependencies
- none
Risks/Unknowns
- none
Acceptance Criteria
Implementation Requirements
change
if (options.timeoutContext?.clearServerSelectionTimeout) timeout?.clear();
to:
if (!options.timeoutContext || options.timeoutContext.clearServerSelectionTimeout) {
timeout?.clear();
}
The logic: clear the timeout if (a) no timeoutContext was provided (we created a local timeout), or (b) the timeoutContext explicitly says to clear it (Legacy path). Do not clear when CSOT owns the timeout (clearServerSelectionTimeout === false).
Testing Requirements
Add a regression test that exercises the retry path on a sharded cluster with a pinned transaction session, verifying no unhandled rejections occur after the operation completes. This can be done by:
- Starting a transaction on a sharded cluster
- Using failCommand to trigger a retryable error on a write inside the transaction
- Verifying the operation succeeds (retry works)
- Waiting >30 seconds (or using a shorter serverSelectionTimeoutMS) and asserting no unhandled rejection fires
Documentation Requirements
- DOCSP ticket, API docs, etc
Follow Up Requirements
- additional tickets to file, required releases, etc
- if node behavior differs/will differ from other drivers, confirm with dbx devs what standard to aim for and what plan, if any, exists to reconcile the diverging behavior moving forward
- blocks
-
NODE-7142 Exponential backoff and jitter in retry loops
-
- Needs Review
-