Uploaded image for project: 'Go Driver'
  1. Go Driver
  2. GODRIVER-2145

Connection pool improvements

    • Type: Icon: Epic Epic
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 1.8.0
    • Affects Version/s: None
    • Component/s: None
    • Labels:
      None
    • 5
    • 12
    • 12
    • 150
    • Hide

      Engineer: Matt

      Summary: Address unnecessary connection pool clears and connection churn on operations with a small context timeout

      2021-11-02: Updated target date to 2021-11-05

      Status update:

      • Functional changes merged.
      • Tesla has been informed of the changes, and encouraged to run a load test.
      • Remaining work is technical debt.

      Rationale for delays:

      • No rationale.

      Risks:

      • None.

      2021-10-19: Updated target date to 2021-10-29

      Status update:

      • Connection pool rewrite is waiting on final review from Kevin and resolving a problem with a stress test using up all ports on some hosts.
      • Put up a review for reducing connection churn by erroring early if the operation timeout is less than the minimum tracked RTT.
      • Next: planning to ask Tesla to test a release candidate

      Rationale for delays:

      • Surprising test failure with stress tests using up all ports.
      • Longer review time from reviewers with competing priorities.

      Risks:

      • None.

      2021-10-05: Updated target date to 2021-10-22.

      Status update:

      • Connection pool creating connections in the background still in review. Meeting to discuss it in person next week.
      • Stress tests now in review.
      • Investigated and scoped a solution to avoiding connection churn when little time remains on an operation context.

      Rationale for delays:

      • Four days of PTO.
      • Review of connection pool redesign taking longer than anticipated.

      Risks:

      • None.

      2021-09-21: No update to target date.

      Status update:

      • Connection pool creating connections in the background in review.

      Rationale for delays:

      • No delays.

      Risks:

      • None. Still on track 2021-10-15 target date.

      2021-09-07: Setting target date to 2021-10-15

      Rationale: Estimated remaining work totals 6 weeks:

      Show
      Engineer: Matt Summary: Address unnecessary connection pool clears and connection churn on operations with a small context timeout 2021-11-02: Updated target date to 2021-11-05 Status update: Functional changes merged. Tesla has been informed of the changes, and encouraged to run a load test. Remaining work is technical debt. Rationale for delays: No rationale. Risks: None. 2021-10-19: Updated target date to 2021-10-29 Status update: Connection pool rewrite is waiting on final review from Kevin and resolving a problem with a stress test using up all ports on some hosts. Put up a review for reducing connection churn by erroring early if the operation timeout is less than the minimum tracked RTT. Next: planning to ask Tesla to test a release candidate Rationale for delays: Surprising test failure with stress tests using up all ports. Longer review time from reviewers with competing priorities. Risks: None. 2021-10-05: Updated target date to 2021-10-22. Status update: Connection pool creating connections in the background still in review. Meeting to discuss it in person next week. Stress tests now in review. Investigated and scoped a solution to avoiding connection churn when little time remains on an operation context. Rationale for delays: Four days of PTO. Review of connection pool redesign taking longer than anticipated. Risks: None. 2021-09-21: No update to target date. Status update: Connection pool creating connections in the background in review. Rationale for delays: No delays. Risks: None. Still on track 2021-10-15 target date. 2021-09-07: Setting target date to 2021-10-15 Rationale: Estimated remaining work totals 6 weeks: GODRIVER-2038 : 2 weeks GODRIVER-2065 : 3 weeks. This is a rough estimate. There is no prototype yet. GODRIVER-2078 : 1 week

      Summary

      Address unnecessary connection pool clears and connection churn on operations with a small context timeout.

      Motivation

      Who is the affected end user?

      Customers and users setting small context timeouts on operations.

      How does this affect the end user?

      The Go driver can enter a state where all connections in a connection pool are perished (usually caused by a connection pool clear due to encountering a timeout during new connection establishment) and almost no new connections can be established in-line with an operation due to the low operation timeout. Customers using the Go driver may encounter application outages as a result of the Go driver’s inability to recover from this state.

      How likely is it that this problem or use case will occur?

      There are known customers experiencing this issue. It is likely if users are setting a small context timeout in operations.

      If the problem does occur, what are the consequences and how severe are they?

      A problem can result in an outage.

      Is this issue urgent?

      Yes, there are known customers that cannot migrate workloads to the Go driver due to this issue.

      Is this ticket required by a downstream team?

      No

      Is this ticket only for tests?

      No

      Cast of Characters

      Engineering Lead: Matt Dale
      Document Author: Matt Dale
      POCers:
      Product Owner:
      Program Manager:
      Stakeholders:

      Channels & Docs

      Technical Design Document

            Assignee:
            matt.dale@mongodb.com Matt Dale
            Reporter:
            kevin.albertson@mongodb.com Kevin Albertson
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved:
              16 weeks, 4 days