Avoid clearing the connection pool when the server connection rate limiter triggers

XMLWordPrintableJSON

    • Needed
    • Hide

      Summary of necessary driver changes

      •  

      Commits for syncing spec/prose tests
      (and/or refer to an existing language POC if needed)

      •  

      Context for other referenced/linked tickets

      •  
      Show
      Summary of necessary driver changes   Commits for syncing spec/prose tests (and/or refer to an existing language POC if needed)   Context for other referenced/linked tickets  

      Summary

      Avoid clearing the connection pool when the server connection rate limiter triggers.

      Motivation

      When a driver is creating a new connection to an overloaded server and it rejects due to the ingress connection rate limiter, the driver will react by clearing the pool, closing the SDAM connection, and triggering an immediate SDAM check. This is bad for a few reasons: 1) the immediate SDAM check will need to create a new connection which will likely also fail for the same reason 2) the existing connections in the pool were healthy and there was no reason to clear them. Now the client needs to repopulate the connection pool which puts even more connection creation pressure on the already overloaded node.

      In practice this behavior acts as a sort of bad circuit breaker which can shut off traffic to the overloaded server, after a time the server recovers, then hits the rate limit again which shuts off traffic again, leading to a potential meta stable failure mode.

      Who is the affected end user?

      Who are the stakeholders?

      How does this affect the end user?

      Are they blocked? Are they annoyed? Are they confused?

      How likely is it that this problem or use case will occur?

      Main path? Edge case?

      If the problem does occur, what are the consequences and how severe are they?

      Minor annoyance at a log message? Performance concern? Outage/unavailability? Failover can't complete?

      Is this issue urgent?

      Does this ticket have a required timeline? What is it?

      Is this ticket required by a downstream team?

      Needed by e.g. Atlas, Shell, Compass?

      Is this ticket only for tests?

      Does this ticket have any functional impact, or is it just test improvements?

      Acceptance Criteria

      What specific requirements must be met to consider the design phase complete?

              Assignee:
              Unassigned
              Reporter:
              Shane Harvey
              Jib Adegunloye Jib Adegunloye
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated: