Uploaded image for project: 'Go Driver'
  1. Go Driver
  2. GODRIVER-2024

Connection pool, long semaphore wait causes connection close

    • Type: Icon: Improvement Improvement
    • Resolution: Done
    • Priority: Icon: Critical - P2 Critical - P2
    • None
    • Affects Version/s: 1.5.2
    • Component/s: Connections
    • Labels:
      None
    • Needed
    • Hide

      The proposed solution is to make two changes, both need additional documentation:
      1. Metrics for connection pool checkout duration (semaphore wait time)
      2. A new configuration tracking minimum connection io duration or maximum connection checkout (semaphore wait) duration.

      Show
      The proposed solution is to make two changes, both need additional documentation: 1. Metrics for connection pool checkout duration (semaphore wait time) 2. A new configuration tracking minimum connection io duration or maximum connection checkout (semaphore wait) duration.

      Problem

      We found the driver unnecessarily closes connections and clears the connection pool under high load.

      This occurs when the semaphore wait time to acquire a connection approaches the context timeout. If a connection is acquired with little to no context deadline left the connection is closed as any use of the connection results in a timeout. After the connection is closed other go routines will attempt to open a connection with a similarly low deadline; when the new connection fails to create, the entire pool is cleared (generation iterated). This non-virtuous cycle repeats and both increases error rates and cluster cpu (to serve creating the new connections).

      Proposed Solution

      1. Publish metrics for connection pool checkout duration (semaphore wait time)
      2. Prevent closing connections when remaining deadline is below a threshold. This can be accomplished in one of a few ways:
        1. Add a client option for minimum connection io duration. After acquiring a connection if the context has a deadline and the remaining duration is below the minimum connection io duration fail fast before attempting to use the connection.
        2. Add a client option for maximum connection pool checkout duration (semaphore wait duration). If the context has a deadline and the deadline is greater than the maximum checkout duration, call acquire with a new context with a deadline equal to the maximum semaphore wait time.

      example error pattern:

      time="2021-05-24T14:29:24-07:00" level=info msg=mongo_pool_event activity=true connection_id=0 reason=timeout type=ConnectionCheckOutFailedSemaphore
      time="2021-05-24T14:29:24-07:00" level=info msg=mongo_pool_event activity=true connection_id=0 reason="ProcessHandshakeError: connection() error occured during connection handshake: context deadline exceeded" type=ConnectionPoolCleared
      

      A way to replicate this problem locally is to run a script with high concurrency, low timeout and low maximum connection pool count.

            Assignee:
            matt.dale@mongodb.com Matt Dale
            Reporter:
            akahn@tesla.com Aaron Kahn
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: