-
Type:
Task
-
Resolution: Unresolved
-
Priority:
Unknown
-
None
-
Component/s: Backpressure, Retryability
-
None
Summary
We need to extend client backpressure so that the token-bucket retry budget is tracked per node (mongod/mongos) rather than only per client, allowing retries to be throttled locally on overloaded nodes while preserving retry capacity for healthy nodes.
Motivation
Who is the affected end user?
Any client leveraging a MongoDB Driver.
How does this affect the end user?
In partial-overload scenarios, users may see unnecessarily high error rates, lower throughput, and worse latency, because overload on a single node drains a global per-client token bucket and reduces retries even against healthy nodes. They aren’t completely blocked, but they experience degraded performance and suboptimal behavior under load.
How likely is it that this problem or use case will occur?
Frequency is harder to determine as we need to collect statistics on this via retry telemetry statistics. It is most relevant for high-load / production deployments that have uneven load or hotspots across nodes (e.g., some mongos or members overloaded while others remain healthy), and less relevant for small or uniformly loaded clusters.
If the problem does occur, what are the consequences and how severe are they?
Primarily a performance concern: elevated overload error rates, reduced goodput, and worse P95/P99 latency compared to what’s achievable with node-aware retry budgeting. In extreme cases this could contribute to more severe degradation during overload but is not directly an availability outage by itself.
Is this issue urgent?
Moderately urgent as a follow-on to existing backpressure work: it refines behavior for partial-overload cases but does not block the existing per-client backpressure rollout. No hard date is specified; priority is Major / Critical depending on IWM roadmap.
Is this ticket required by a downstream team?
Yes, it supports downstream Workload Resilience / IWM goals and perf validation efforts, but there isn’t a single named consumer like Atlas, Shell, or Compass; it improves the core driver behavior that those surfaces rely on.
Is this ticket only for tests?
No. While perf workloads and observability are part of acceptance criteria, this ticket is for a functional spec and behavior change (introducing per-node token buckets) rather than just test-only improvements.
Cast of Characters
Engineering Lead:
Document Author:
POCers:
Product Owner:
Program Manager:
Stakeholders:
Channels & Docs
Slack Channel
[Scope Document|some.url]
[Technical Design Document|some.url]
- related to
-
DRIVERS-3464 Implement server-side handling for retry metadata sent from drivers
-
- Needs Triage
-