Client Backpressure Support

XMLWordPrintableJSON

    • Client Backpressure Spec
    • Python Drivers
    • None
    • Hide

      Summary of necessary driver changes

      •  

      Commits for syncing spec/prose tests
      (and/or refer to an existing language POC if needed)

      •  

      Context for other referenced/linked tickets

      •  
      Show
      Summary of necessary driver changes   Commits for syncing spec/prose tests (and/or refer to an existing language POC if needed)   Context for other referenced/linked tickets  
    • In Progress
    • 16
    • 19.5
    • 24
    • 100
    • 50
    • 🟢 On Track
    • Hide

      Engineer(s): Shane Harvey, Steve Silvester, Iris Ho, Bailey Pearson

      What was accomplished since the last update?

      • Draft implementation for Node (2nd implementation for spec) complete on DRIVERS-1934, pending design finalization
      • Started Node draft implementation for connection rate limiter (DRIVERS-3218)
      • Token bucket in sustained overload workload (PERF-7397) completed
        • Result: ~90% reduction in average latency during sustained overload and similar reductions in 95th and 99th latency percentiles at the expense of 2.8x increase in error rate
      • Reviewed Decision for Connection Rate Limiting (WRITING-34116). The plan agreed on was to rollout the connection rate limiter with a conservative 20 second long queue (ingressConnectionEstablishmentBurstCapacitySecs).
        • This conservative limit is set to avoid rejecting connections that the server has a good chance of accepting within the driver's connect timeout because older drivers will clear the connection pool and induce a meta stable failure. Once customers upgrade to backpressure enabled drivers, we can decrease this queuing time.
      • Gained insight into individual language challenges arising from the originally proposed design and revised the design proposal to mitigate these:
        • We benchmarked the new Pool backoff state separately from the "don't clear the connection pool" change and found no difference in the workload. We decided we can safely cut the pool backoff state to reduce the scope. This will simplify the implementation estimates for drivers.
        • We also decided to drop the requirement to interrupt pending connections to 1) reduce the scope and 2) avoid extra churn on connection creation attempts that might succeed.

      What's the focus over the next two weeks?

      • Finalize design and estimates/delivery timelines for the project
      • Finalize spec changes (including 2nd implementation) for DRIVERS-1934

      Any risks/blockers/impediments?

      • N/A
      Show
      Engineer(s): Shane Harvey, Steve Silvester, Iris Ho, Bailey Pearson What was accomplished since the last update? Draft implementation for Node (2nd implementation for spec) complete on DRIVERS-1934 , pending design finalization Started Node draft implementation for connection rate limiter ( DRIVERS-3218 ) Token bucket in sustained overload workload (PERF-7397) completed Result: ~90% reduction in average latency during sustained overload and similar reductions in 95th and 99th latency percentiles at the expense of 2.8x increase in error rate Reviewed Decision for Connection Rate Limiting (WRITING-34116). The plan agreed on was to rollout the connection rate limiter with a conservative 20 second long queue (ingressConnectionEstablishmentBurstCapacitySecs). This conservative limit is set to avoid rejecting connections that the server has a good chance of accepting within the driver's connect timeout because older drivers will clear the connection pool and induce a meta stable failure. Once customers upgrade to backpressure enabled drivers, we can decrease this queuing time. Gained insight into individual language challenges arising from the originally proposed design and revised the design proposal to mitigate these: We benchmarked the new Pool backoff state separately from the "don't clear the connection pool" change and found no difference in the workload. We decided we can safely cut the pool backoff state to reduce the scope. This will simplify the implementation estimates for drivers. We also decided to drop the requirement to interrupt pending connections to 1) reduce the scope and 2) avoid extra churn on connection creation attempts that might succeed. What's the focus over the next two weeks? Finalize design and estimates/delivery timelines for the project Finalize spec changes (including 2nd implementation) for DRIVERS-1934 Any risks/blockers/impediments? N/A
    • Hide

      2025-11-06 - 🟢 On Track
      Engineer(s): Shane Harvey, Steve Silvester, Iris Ho, Bailey Pearson

      What was accomplished since the last update?

      • Draft implementation for Node (2nd implementation for spec) complete on DRIVERS-1934, pending design finalization
      • Started Node draft implementation for connection rate limiter (DRIVERS-3218)
      • Token bucket in sustained overload workload (PERF-7397) completed
        • Result: ~90% reduction in average latency during sustained overload and similar reductions in 95th and 99th latency percentiles at the expense of 2.8x increase in error rate
      • Reviewed Decision for Connection Rate Limiting (WRITING-34116). The plan agreed on was to rollout the connection rate limiter with a conservative 20 second long queue (ingressConnectionEstablishmentBurstCapacitySecs).
        • This conservative limit is set to avoid rejecting connections that the server has a good chance of accepting within the driver's connect timeout because older drivers will clear the connection pool and induce a meta stable failure. Once customers upgrade to backpressure enabled drivers, we can decrease this queuing time.
      • Gained insight into individual language challenges arising from the originally proposed design and revised the design proposal to mitigate these:
        • We benchmarked the new Pool backoff state separately from the "don't clear the connection pool" change and found no difference in the workload. We decided we can safely cut the pool backoff state to reduce the scope. This will simplify the implementation estimates for drivers.
        • We also decided to drop the requirement to interrupt pending connections to 1) reduce the scope and 2) avoid extra churn on connection creation attempts that might succeed.

      What's the focus over the next two weeks?

      • Finalize design and estimates/delivery timelines for the project
      • Finalize spec changes (including 2nd implementation) for DRIVERS-1934

      Any risks/blockers/impediments?

      • N/A

      2025-10-24 - 🟢 On Track
      Engineer(s): Shane Harvey, Steve Silvester, Iris Ho

      What was accomplished since the last update?

      • Operation burst workload to demonstrate benefit of client backpressure (PERF-7190)
        • Results: Without any retries the workload encounters ~8000 overload errors. With 3 max retries the workload encounters ~500 errors. With 5 max retries the workload encounters ~5 errors.
        • Main learning: 3 retries is not sufficient. We need to increase the limit to 5 or more.
      • Moved the Design for Client Backpressure into review (WRITING-32696).
      • Spec changes for withTransaction backpressure is in draft
      • Token bucket workload in review (PERF-7397)

      What's the focus over the next two weeks?

      • Finalize design and estimates for the project
      • Finalize all perf workloads (PERF-7190, PERF-7397, PERF-6964 - only one mongos is overloaded)
      • Finish drafting connection pool spec changes (DRIVERS-3218)

      Any risks/blockers/impediments?

      • Estimates for individual driver implementations depend on design finalization
      • Perf results for token bucket are inconclusive

      2025-10-16 - 🟢 On Track
      2025-10-16:

      • What was completed over the last two weeks?
      1. Completed availability workload to verify our improvements to connection rate limiter error handling in progress (PERF-7078). This confirms our expected perf/availability improvements:
        Latency95thPercentile improves from 8979ms to 93.59ms
        OperationThroughput improves from 2444q/s to 6134q/s.
      2. Perf workload for to verify client backpressure retry policy in review (PERF-7190)
      • What’s the focus over the next two weeks?
      1. Complete remaining performance workloads PERF-7190 and PERF-6964.
      2. Complete review Design document (WRITING-32696).
      3. Draft spec changes for connection pool and withTransaction projects.

      2025-10-03:

      • What was completed over the last two weeks?
      1. Scope document completed (WRITING-32695).
      2. Merged availability reproducer for withTransaction induced write conflict storm (PERF-7188)
      3. Started investigation into default values for withTransaction backoff parameters and thier affect of latency (PYTHON-5562). Discovered we may want to use the backoff algorithm that the server uses for write conflict retry developed in SERVER-88000. The main difference is that the backoff grows more gradually.
      4. Availability workload to verify our improvements to connection rate limiter error handling in progress (PERF-7078)
      • What’s the focus over the next two weeks?
      1. Design document in review (WRITING-32696).
      2. Begin spec changes for connection pool and withTransaction.

      2025-09-29 - 🟢 On Track
      2025-08-19:

      • What was completed over the last two weeks?
      1. Scope document is in review (WRITING-32695).
      2. Python POC work has begun (PYTHON-5504 and PYTHON-5505).
      • What’s the focus over the next two weeks?
      1. Put Design document in review (WRITING-32696).
      2. Complete Python POC for adaptive retry loop (PYTHON-5505 and PYTHON-5506).
      3. Demo improved write conflict storm behavior (PYTHON-5504).

      2025-08-19:

      • What was completed over the last two weeks?
      • What’s the focus over the next two weeks?
        • Put Design document in review (WRITING-32696).
        • Complete Python POC for adaptive retry loop (PYTHON-5505 and PYTHON-5506).
        • Demo improved write conflict storm behavior (PYTHON-5504).
      Show
      2025-11-06 - 🟢 On Track Engineer(s): Shane Harvey, Steve Silvester, Iris Ho, Bailey Pearson What was accomplished since the last update? Draft implementation for Node (2nd implementation for spec) complete on DRIVERS-1934 , pending design finalization Started Node draft implementation for connection rate limiter ( DRIVERS-3218 ) Token bucket in sustained overload workload (PERF-7397) completed Result: ~90% reduction in average latency during sustained overload and similar reductions in 95th and 99th latency percentiles at the expense of 2.8x increase in error rate Reviewed Decision for Connection Rate Limiting (WRITING-34116). The plan agreed on was to rollout the connection rate limiter with a conservative 20 second long queue (ingressConnectionEstablishmentBurstCapacitySecs). This conservative limit is set to avoid rejecting connections that the server has a good chance of accepting within the driver's connect timeout because older drivers will clear the connection pool and induce a meta stable failure. Once customers upgrade to backpressure enabled drivers, we can decrease this queuing time. Gained insight into individual language challenges arising from the originally proposed design and revised the design proposal to mitigate these: We benchmarked the new Pool backoff state separately from the "don't clear the connection pool" change and found no difference in the workload. We decided we can safely cut the pool backoff state to reduce the scope. This will simplify the implementation estimates for drivers. We also decided to drop the requirement to interrupt pending connections to 1) reduce the scope and 2) avoid extra churn on connection creation attempts that might succeed. What's the focus over the next two weeks? Finalize design and estimates/delivery timelines for the project Finalize spec changes (including 2nd implementation) for DRIVERS-1934 Any risks/blockers/impediments? N/A 2025-10-24 - 🟢 On Track Engineer(s): Shane Harvey, Steve Silvester, Iris Ho What was accomplished since the last update? Operation burst workload to demonstrate benefit of client backpressure (PERF-7190) Results: Without any retries the workload encounters ~8000 overload errors. With 3 max retries the workload encounters ~500 errors. With 5 max retries the workload encounters ~5 errors. Main learning: 3 retries is not sufficient. We need to increase the limit to 5 or more. Moved the Design for Client Backpressure into review (WRITING-32696). Spec changes for withTransaction backpressure is in draft Token bucket workload in review (PERF-7397) What's the focus over the next two weeks? Finalize design and estimates for the project Finalize all perf workloads (PERF-7190, PERF-7397, PERF-6964 - only one mongos is overloaded) Finish drafting connection pool spec changes ( DRIVERS-3218 ) Any risks/blockers/impediments? Estimates for individual driver implementations depend on design finalization Perf results for token bucket are inconclusive 2025-10-16 - 🟢 On Track 2025-10-16: What was completed over the last two weeks? Completed availability workload to verify our improvements to connection rate limiter error handling in progress (PERF-7078). This confirms our expected perf/availability improvements: Latency95thPercentile improves from 8979ms to 93.59ms OperationThroughput improves from 2444q/s to 6134q/s. Perf workload for to verify client backpressure retry policy in review (PERF-7190) What’s the focus over the next two weeks? Complete remaining performance workloads PERF-7190 and PERF-6964. Complete review Design document (WRITING-32696). Draft spec changes for connection pool and withTransaction projects. 2025-10-03: What was completed over the last two weeks? Scope document completed (WRITING-32695). Merged availability reproducer for withTransaction induced write conflict storm (PERF-7188) Started investigation into default values for withTransaction backoff parameters and thier affect of latency ( PYTHON-5562 ). Discovered we may want to use the backoff algorithm that the server uses for write conflict retry developed in SERVER-88000 . The main difference is that the backoff grows more gradually. Availability workload to verify our improvements to connection rate limiter error handling in progress (PERF-7078) What’s the focus over the next two weeks? Design document in review (WRITING-32696). Begin spec changes for connection pool and withTransaction. 2025-09-29 - 🟢 On Track 2025-08-19: What was completed over the last two weeks? Scope document is in review (WRITING-32695). Python POC work has begun ( PYTHON-5504 and PYTHON-5505 ). What’s the focus over the next two weeks? Put Design document in review (WRITING-32696). Complete Python POC for adaptive retry loop ( PYTHON-5505 and PYTHON-5506 ). Demo improved write conflict storm behavior ( PYTHON-5504 ). 2025-08-19: What was completed over the last two weeks? Scope document is in review (WRITING-32695). Python POC work has begun ( PYTHON-5504 and PYTHON-5505 ). What’s the focus over the next two weeks? Put Design document in review (WRITING-32696). Complete Python POC for adaptive retry loop ( PYTHON-5505 and PYTHON-5506 ). Demo improved write conflict storm behavior ( PYTHON-5504 ).
    • 11
    • Needed
    • $i18n.getText("admin.common.words.hide")
      Key Status/Resolution FixVersion
      CDRIVER-6073 Backlog
      CXX-3328 Needs Triage
      CSHARP-5701 Needs Triage
      GODRIVER-3637 Backlog
      JAVA-5942 Needs Triage
      NODE-7105 In Progress
      PYTHON-5495 In Progress
      PHPLIB-1703 Needs Triage
      RUBY-3696 Backlog
      RUST-2259 Backlog
      $i18n.getText("admin.common.words.show")
      #scriptField, #scriptField *{ border: 1px solid black; } #scriptField{ border-collapse: collapse; } #scriptField td { text-align: center; /* Center-align text in table cells */ } #scriptField td.key { text-align: left; /* Left-align text in the Key column */ } #scriptField a { text-decoration: none; /* Remove underlines from links */ border: none; /* Remove border from links */ } /* Add green background color to cells with FixVersion */ #scriptField td.hasFixVersion { background-color: #00FF00; /* Green color code */ } #scriptField td.willNotDo { background-color: #FF0000; /* Red color code */ } /* Center-align the first row headers */ #scriptField th { text-align: center; } Key Status/Resolution FixVersion CDRIVER-6073 Backlog CXX-3328 Needs Triage CSHARP-5701 Needs Triage GODRIVER-3637 Backlog JAVA-5942 Needs Triage NODE-7105 In Progress PYTHON-5495 In Progress PHPLIB-1703 Needs Triage RUBY-3696 Backlog RUST-2259 Backlog
    • None
    • None
    • None
    • None
    • None
    • None

      Summary

      The server will start to shed load during overload. We need a spec to do so safely and correctly.

      Motivation

      This is a critical piece of improving MongoDB Availability.

            Assignee:
            Shane Harvey
            Reporter:
            Judah Schvimer
            Daria Pardue Daria Pardue
            None
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              29 weeks, 3 days
              None