Token Bucket retry per-server

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Unresolved
    • Priority: Unknown
    • None
    • Component/s: Backpressure, Retryability
    • None
    • Needed

      Summary

      We need to extend client backpressure so that the token-bucket retry budget is tracked per server (individual mongod/mongos instance) rather than only per client, allowing retries to be throttled based on the health and overload state of each server independently.

      Motivation

      Who is the affected end user?

      Any client leveraging a MongoDB Driver.

      How does this affect the end user?

      Without per-server token buckets, overload on one server can exhaust a shared retry budget and reduce retries against other servers, leading to unnecessarily high error rates, lower throughput, and worse latency. Users aren’t completely blocked, but see degraded and harder-to-predict performance, especially when some servers are healthy and others are not.

      How likely is it that this problem or use case will occur?

      Frequency is harder to determine as we need to collect statistics on this via retry telemetry statistics. It is most relevant for multi-node deployments (replica sets, sharded clusters) under uneven or bursty load, where some servers are frequently overloaded while others remain able to serve traffic.

      If the problem does occur, what are the consequences and how severe are they?

      Primarily a performance and efficiency concern:

      Elevated overload error rates from servers that could otherwise be avoided.
      Reduced goodput and worse P95/P99 latency compared to what we could achieve with per-server retry budgeting.
      In more severe overload, this can exacerbate degradation but is not by itself a hard outage.

      Is this issue urgent?

      Moderately urgent as a refinement to existing backpressure behavior: it improves how drivers behave under heterogeneous server load but does not block the current backpressure/IWM rollout. No specific date is mandated; priority is Major / Critical depending on IWM roadmap and perf findings.

      Is this ticket required by a downstream team?

      Yes, it supports downstream Workload Resilience / IWM and performance goals by giving a more accurate mapping between server overload and client retries, which benefits Atlas and any product depending on stable overload behavior, even if not tied to a single named consumer.

      Is this ticket only for tests?

      No. While it will require new perf/validation workloads, this is a functional behavioral change (introducing per-server token-bucket semantics in drivers), not just additional testing.

            Assignee:
            Unassigned
            Reporter:
            Jib Adegunloye
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: