Client Backpressure Support

XMLWordPrintableJSON

    • Client Backpressure Spec
    • Python Drivers
    • None
    • Hide

      Summary of necessary driver changes

      •  

      Commits for syncing spec/prose tests
      (and/or refer to an existing language POC if needed)

      •  

      Context for other referenced/linked tickets

      •  
      Show
      Summary of necessary driver changes   Commits for syncing spec/prose tests (and/or refer to an existing language POC if needed)   Context for other referenced/linked tickets  
    • In Progress
    • 16
    • 22.5
    • 24
    • 100
    • 50
    • 🟢 On Track
    • Hide

      Engineer(s): Shane Harvey, Steve Silvester, Iris Ho, Bailey Pearson

      What was accomplished since the last update?

      • DRIVERS-1934 withTransaction spec 1st and 2nd implementation drafts complete, only waiting on transaction spec owner review
      • Node draft implementation for connection rate limiter (DRIVERS-3218) is complete, spec awaiting answer to one open question
      • Began work to understand the behavior of drivers when only a single mongos is overloaded (PERF-6964)
      • Have an initial draft and python POC for backoff and jitter changes (DRIVERS-3239)

      What's the focus over the next two weeks?

      Any risks/blockers/impediments?

      • If the outcome of PERF-6964 indicates that DRIVERS-3217 is mandatory, it will add additional scope to the project (2-4 weeks per driver). In trying to implement this workload, we hit an unexpected issue because DSI’s set_server_parameter is not enabling the server parameters on replica set secondaries, which required additional work in DSI to be resolved.
      Show
      Engineer(s): Shane Harvey, Steve Silvester, Iris Ho, Bailey Pearson What was accomplished since the last update? DRIVERS-1934 withTransaction spec 1st and 2nd implementation drafts complete, only waiting on transaction spec owner review Node draft implementation for connection rate limiter ( DRIVERS-3218 ) is complete, spec awaiting answer to one open question Began work to understand the behavior of drivers when only a single mongos is overloaded (PERF-6964) Have an initial draft and python POC for backoff and jitter changes ( DRIVERS-3239 ) What's the focus over the next two weeks? Complete spec changes for DRIVERS-1934 and DRIVERS-3218 and unblock corresponding team implementations Finalize draft of spec changes for DRIVERS-3239 Complete PERF-6964 and determine whether DRIVERS-3217 can be deferred or will be required Draft minor server selection spec updates ( DRIVERS-2901 ) Any risks/blockers/impediments? If the outcome of PERF-6964 indicates that DRIVERS-3217 is mandatory, it will add additional scope to the project (2-4 weeks per driver). In trying to implement this workload, we hit an unexpected issue because DSI’s set_server_parameter is not enabling the server parameters on replica set secondaries, which required additional work in DSI to be resolved.
    • Hide

      2025-11-21 - 🟢 On Track
      Engineer(s): Shane Harvey, Steve Silvester, Iris Ho, Bailey Pearson

      What was accomplished since the last update?

      • DRIVERS-1934 withTransaction spec 1st and 2nd implementation drafts complete, only waiting on transaction spec owner review
      • Node draft implementation for connection rate limiter (DRIVERS-3218) is complete, spec awaiting answer to one open question
      • Began work to understand the behavior of drivers when only a single mongos is overloaded (PERF-6964)
      • Have an initial draft and python POC for backoff and jitter changes (DRIVERS-3239)

      What's the focus over the next two weeks?

      Any risks/blockers/impediments?

      • If the outcome of PERF-6964 indicates that DRIVERS-3217 is mandatory, it will add additional scope to the project (2-4 weeks per driver). In trying to implement this workload, we hit an unexpected issue because DSI’s set_server_parameter is not enabling the server parameters on replica set secondaries, which required additional work in DSI to be resolved.

      2025-11-21 - 🟢 On Track
      Engineer(s): Shane Harvey, Steve Silvester, Iris Ho, Bailey Pearson

      What was accomplished since the last update?

      • DRIVERS-1934 withTransaction spec 1st and 2nd implementation drafts complete, only waiting on transaction spec owner review
      • Node draft implementation for connection rate limiter (DRIVERS-3218) is complete, spec awaiting answer to one open question
      • Began work to understand the behavior of drivers when only a single mongos is overloaded (PERF-6964)

      What's the focus over the next two weeks?

      Any risks/blockers/impediments?

      • If the outcome of PERF-6964 indicates that DRIVERS-3217 is mandatory, it will add additional scope to the project (2-4 weeks per driver). In trying to implement this workload, we hit an unexpected issue because DSI’s set_server_parameter is not enabling the server parameters on replica set secondaries, which required additional work in DSI to be resolved.

      2025-11-06 - 🟢 On Track
      Engineer(s): Shane Harvey, Steve Silvester, Iris Ho, Bailey Pearson

      What was accomplished since the last update?

      • Draft implementation for Node (2nd implementation for spec) complete on DRIVERS-1934, pending design finalization
      • Started Node draft implementation for connection rate limiter (DRIVERS-3218)
      • Token bucket in sustained overload workload (PERF-7397) completed
        • Result: ~90% reduction in average latency during sustained overload and similar reductions in 95th and 99th latency percentiles at the expense of 2.8x increase in error rate
      • Reviewed Decision for Connection Rate Limiting (WRITING-34116). The plan agreed on was to rollout the connection rate limiter with a conservative 20 second long queue (ingressConnectionEstablishmentBurstCapacitySecs).
        • This conservative limit is set to avoid rejecting connections that the server has a good chance of accepting within the driver's connect timeout because older drivers will clear the connection pool and induce a meta stable failure. Once customers upgrade to backpressure enabled drivers, we can decrease this queuing time.
      • Gained insight into individual language challenges arising from the originally proposed design and revised the design proposal to mitigate these:
        • We benchmarked the new Pool backoff state separately from the "don't clear the connection pool" change and found no difference in the workload. We decided we can safely cut the pool backoff state to reduce the scope. This will simplify the implementation estimates for drivers.
        • We also decided to drop the requirement to interrupt pending connections to 1) reduce the scope and 2) avoid extra churn on connection creation attempts that might succeed.

      What's the focus over the next two weeks?

      • Finalize design and estimates/delivery timelines for the project
      • Finalize spec changes (including 2nd implementation) for DRIVERS-1934

      Any risks/blockers/impediments?

      • N/A

      2025-10-24 - 🟢 On Track
      Engineer(s): Shane Harvey, Steve Silvester, Iris Ho

      What was accomplished since the last update?

      • Operation burst workload to demonstrate benefit of client backpressure (PERF-7190)
        • Results: Without any retries the workload encounters ~8000 overload errors. With 3 max retries the workload encounters ~500 errors. With 5 max retries the workload encounters ~5 errors.
        • Main learning: 3 retries is not sufficient. We need to increase the limit to 5 or more.
      • Moved the Design for Client Backpressure into review (WRITING-32696).
      • Spec changes for withTransaction backpressure is in draft
      • Token bucket workload in review (PERF-7397)

      What's the focus over the next two weeks?

      • Finalize design and estimates for the project
      • Finalize all perf workloads (PERF-7190, PERF-7397, PERF-6964 - only one mongos is overloaded)
      • Finish drafting connection pool spec changes (DRIVERS-3218)

      Any risks/blockers/impediments?

      • Estimates for individual driver implementations depend on design finalization
      • Perf results for token bucket are inconclusive

      2025-10-16 - 🟢 On Track
      2025-10-16:

      • What was completed over the last two weeks?
      1. Completed availability workload to verify our improvements to connection rate limiter error handling in progress (PERF-7078). This confirms our expected perf/availability improvements:
        Latency95thPercentile improves from 8979ms to 93.59ms
        OperationThroughput improves from 2444q/s to 6134q/s.
      2. Perf workload for to verify client backpressure retry policy in review (PERF-7190)
      • What’s the focus over the next two weeks?
      1. Complete remaining performance workloads PERF-7190 and PERF-6964.
      2. Complete review Design document (WRITING-32696).
      3. Draft spec changes for connection pool and withTransaction projects.

      2025-10-03:

      • What was completed over the last two weeks?
      1. Scope document completed (WRITING-32695).
      2. Merged availability reproducer for withTransaction induced write conflict storm (PERF-7188)
      3. Started investigation into default values for withTransaction backoff parameters and thier affect of latency (PYTHON-5562). Discovered we may want to use the backoff algorithm that the server uses for write conflict retry developed in SERVER-88000. The main difference is that the backoff grows more gradually.
      4. Availability workload to verify our improvements to connection rate limiter error handling in progress (PERF-7078)
      • What’s the focus over the next two weeks?
      1. Design document in review (WRITING-32696).
      2. Begin spec changes for connection pool and withTransaction.

      2025-09-29 - 🟢 On Track
      2025-08-19:

      • What was completed over the last two weeks?
      1. Scope document is in review (WRITING-32695).
      2. Python POC work has begun (PYTHON-5504 and PYTHON-5505).
      • What’s the focus over the next two weeks?
      1. Put Design document in review (WRITING-32696).
      2. Complete Python POC for adaptive retry loop (PYTHON-5505 and PYTHON-5506).
      3. Demo improved write conflict storm behavior (PYTHON-5504).

      2025-08-19:

      • What was completed over the last two weeks?
      • What’s the focus over the next two weeks?
        • Put Design document in review (WRITING-32696).
        • Complete Python POC for adaptive retry loop (PYTHON-5505 and PYTHON-5506).
        • Demo improved write conflict storm behavior (PYTHON-5504).
      Show
      2025-11-21 - 🟢 On Track Engineer(s): Shane Harvey, Steve Silvester, Iris Ho, Bailey Pearson What was accomplished since the last update? DRIVERS-1934 withTransaction spec 1st and 2nd implementation drafts complete, only waiting on transaction spec owner review Node draft implementation for connection rate limiter ( DRIVERS-3218 ) is complete, spec awaiting answer to one open question Began work to understand the behavior of drivers when only a single mongos is overloaded (PERF-6964) Have an initial draft and python POC for backoff and jitter changes ( DRIVERS-3239 ) What's the focus over the next two weeks? Complete spec changes for DRIVERS-1934 and DRIVERS-3218 and unblock corresponding team implementations Finalize draft of spec changes for DRIVERS-3239 Complete PERF-6964 and determine whether DRIVERS-3217 can be deferred or will be required Draft minor server selection spec updates ( DRIVERS-2901 ) Any risks/blockers/impediments? If the outcome of PERF-6964 indicates that DRIVERS-3217 is mandatory, it will add additional scope to the project (2-4 weeks per driver). In trying to implement this workload, we hit an unexpected issue because DSI’s set_server_parameter is not enabling the server parameters on replica set secondaries, which required additional work in DSI to be resolved. 2025-11-21 - 🟢 On Track Engineer(s): Shane Harvey, Steve Silvester, Iris Ho, Bailey Pearson What was accomplished since the last update? DRIVERS-1934 withTransaction spec 1st and 2nd implementation drafts complete, only waiting on transaction spec owner review Node draft implementation for connection rate limiter ( DRIVERS-3218 ) is complete, spec awaiting answer to one open question Began work to understand the behavior of drivers when only a single mongos is overloaded (PERF-6964) What's the focus over the next two weeks? Complete spec changes for DRIVERS-1934 and DRIVERS-3218 and unblock corresponding team implementations Finalize draft of spec changes for DRIVERS-3239 Complete PERF-6964 and determine whether DRIVERS-3217 can be deferred or will be required Draft minor server selection spec updates ( DRIVERS-2901 ) Any risks/blockers/impediments? If the outcome of PERF-6964 indicates that DRIVERS-3217 is mandatory, it will add additional scope to the project (2-4 weeks per driver). In trying to implement this workload, we hit an unexpected issue because DSI’s set_server_parameter is not enabling the server parameters on replica set secondaries, which required additional work in DSI to be resolved. 2025-11-06 - 🟢 On Track Engineer(s): Shane Harvey, Steve Silvester, Iris Ho, Bailey Pearson What was accomplished since the last update? Draft implementation for Node (2nd implementation for spec) complete on DRIVERS-1934 , pending design finalization Started Node draft implementation for connection rate limiter ( DRIVERS-3218 ) Token bucket in sustained overload workload (PERF-7397) completed Result: ~90% reduction in average latency during sustained overload and similar reductions in 95th and 99th latency percentiles at the expense of 2.8x increase in error rate Reviewed Decision for Connection Rate Limiting (WRITING-34116). The plan agreed on was to rollout the connection rate limiter with a conservative 20 second long queue (ingressConnectionEstablishmentBurstCapacitySecs). This conservative limit is set to avoid rejecting connections that the server has a good chance of accepting within the driver's connect timeout because older drivers will clear the connection pool and induce a meta stable failure. Once customers upgrade to backpressure enabled drivers, we can decrease this queuing time. Gained insight into individual language challenges arising from the originally proposed design and revised the design proposal to mitigate these: We benchmarked the new Pool backoff state separately from the "don't clear the connection pool" change and found no difference in the workload. We decided we can safely cut the pool backoff state to reduce the scope. This will simplify the implementation estimates for drivers. We also decided to drop the requirement to interrupt pending connections to 1) reduce the scope and 2) avoid extra churn on connection creation attempts that might succeed. What's the focus over the next two weeks? Finalize design and estimates/delivery timelines for the project Finalize spec changes (including 2nd implementation) for DRIVERS-1934 Any risks/blockers/impediments? N/A 2025-10-24 - 🟢 On Track Engineer(s): Shane Harvey, Steve Silvester, Iris Ho What was accomplished since the last update? Operation burst workload to demonstrate benefit of client backpressure (PERF-7190) Results: Without any retries the workload encounters ~8000 overload errors. With 3 max retries the workload encounters ~500 errors. With 5 max retries the workload encounters ~5 errors. Main learning: 3 retries is not sufficient. We need to increase the limit to 5 or more. Moved the Design for Client Backpressure into review (WRITING-32696). Spec changes for withTransaction backpressure is in draft Token bucket workload in review (PERF-7397) What's the focus over the next two weeks? Finalize design and estimates for the project Finalize all perf workloads (PERF-7190, PERF-7397, PERF-6964 - only one mongos is overloaded) Finish drafting connection pool spec changes ( DRIVERS-3218 ) Any risks/blockers/impediments? Estimates for individual driver implementations depend on design finalization Perf results for token bucket are inconclusive 2025-10-16 - 🟢 On Track 2025-10-16: What was completed over the last two weeks? Completed availability workload to verify our improvements to connection rate limiter error handling in progress (PERF-7078). This confirms our expected perf/availability improvements: Latency95thPercentile improves from 8979ms to 93.59ms OperationThroughput improves from 2444q/s to 6134q/s. Perf workload for to verify client backpressure retry policy in review (PERF-7190) What’s the focus over the next two weeks? Complete remaining performance workloads PERF-7190 and PERF-6964. Complete review Design document (WRITING-32696). Draft spec changes for connection pool and withTransaction projects. 2025-10-03: What was completed over the last two weeks? Scope document completed (WRITING-32695). Merged availability reproducer for withTransaction induced write conflict storm (PERF-7188) Started investigation into default values for withTransaction backoff parameters and thier affect of latency ( PYTHON-5562 ). Discovered we may want to use the backoff algorithm that the server uses for write conflict retry developed in SERVER-88000 . The main difference is that the backoff grows more gradually. Availability workload to verify our improvements to connection rate limiter error handling in progress (PERF-7078) What’s the focus over the next two weeks? Design document in review (WRITING-32696). Begin spec changes for connection pool and withTransaction. 2025-09-29 - 🟢 On Track 2025-08-19: What was completed over the last two weeks? Scope document is in review (WRITING-32695). Python POC work has begun ( PYTHON-5504 and PYTHON-5505 ). What’s the focus over the next two weeks? Put Design document in review (WRITING-32696). Complete Python POC for adaptive retry loop ( PYTHON-5505 and PYTHON-5506 ). Demo improved write conflict storm behavior ( PYTHON-5504 ). 2025-08-19: What was completed over the last two weeks? Scope document is in review (WRITING-32695). Python POC work has begun ( PYTHON-5504 and PYTHON-5505 ). What’s the focus over the next two weeks? Put Design document in review (WRITING-32696). Complete Python POC for adaptive retry loop ( PYTHON-5505 and PYTHON-5506 ). Demo improved write conflict storm behavior ( PYTHON-5504 ).
    • 11
    • Needed
    • $i18n.getText("admin.common.words.hide")
      Key Status/Resolution FixVersion
      CDRIVER-6073 Backlog
      CXX-3328 Backlog
      CSHARP-5701 Needs Triage
      GODRIVER-3637 Backlog
      JAVA-5942 Needs Triage
      NODE-7105 In Progress
      PYTHON-5495 In Progress
      PHPLIB-1703 Backlog
      RUBY-3696 Backlog
      RUST-2259 Backlog
      $i18n.getText("admin.common.words.show")
      #scriptField, #scriptField *{ border: 1px solid black; } #scriptField{ border-collapse: collapse; } #scriptField td { text-align: center; /* Center-align text in table cells */ } #scriptField td.key { text-align: left; /* Left-align text in the Key column */ } #scriptField a { text-decoration: none; /* Remove underlines from links */ border: none; /* Remove border from links */ } /* Add green background color to cells with FixVersion */ #scriptField td.hasFixVersion { background-color: #00FF00; /* Green color code */ } #scriptField td.willNotDo { background-color: #FF0000; /* Red color code */ } /* Center-align the first row headers */ #scriptField th { text-align: center; } Key Status/Resolution FixVersion CDRIVER-6073 Backlog CXX-3328 Backlog CSHARP-5701 Needs Triage GODRIVER-3637 Backlog JAVA-5942 Needs Triage NODE-7105 In Progress PYTHON-5495 In Progress PHPLIB-1703 Backlog RUBY-3696 Backlog RUST-2259 Backlog
    • None
    • None
    • None
    • None
    • None
    • None

      Summary

      The server will start to shed load during overload. We need a spec to do so safely and correctly.

      Motivation

      This is a critical piece of improving MongoDB Availability.

            Assignee:
            Bailey Pearson
            Reporter:
            Judah Schvimer
            Daria Pardue Daria Pardue
            None
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              29 weeks, 3 days
              None