Uploaded image for project: 'Drivers'
  1. Drivers
  2. DRIVERS-2347

Prevent conflating operation timeout with connection establishment timeout

    • Type: Icon: Improvement Improvement
    • Resolution: Unresolved
    • Priority: Icon: Unknown Unknown
    • None
    • Component/s: CMAP, CSOT
    • Labels:
      None
    • $i18n.getText("admin.common.words.hide")
      Key Status/Resolution FixVersion
      CDRIVER-4681 Blocked
      CXX-2712 Blocked
      CSHARP-4715 Blocked
      GODRIVER-2903 Blocked
      JAVA-5069 Blocked
      NODE-5451 Blocked
      MOTOR-1151 Blocked
      PYTHON-3834 Blocked
      PHPLIB-1193 Blocked
      RUBY-3295 Blocked
      RUST-1700 Blocked
      $i18n.getText("admin.common.words.show")
      #scriptField, #scriptField *{ border: 1px solid black; } #scriptField{ border-collapse: collapse; } #scriptField td { text-align: center; /* Center-align text in table cells */ } #scriptField td.key { text-align: left; /* Left-align text in the Key column */ } #scriptField a { text-decoration: none; /* Remove underlines from links */ border: none; /* Remove border from links */ } /* Add green background color to cells with FixVersion */ #scriptField td.hasFixVersion { background-color: #00FF00; /* Green color code */ } /* Center-align the first row headers */ #scriptField th { text-align: center; } Key Status/Resolution FixVersion CDRIVER-4681 Blocked CXX-2712 Blocked CSHARP-4715 Blocked GODRIVER-2903 Blocked JAVA-5069 Blocked NODE-5451 Blocked MOTOR-1151 Blocked PYTHON-3834 Blocked PHPLIB-1193 Blocked RUBY-3295 Blocked RUST-1700 Blocked

      Summary

      The process for checking out and establishing connections described in the CMAP spec combined with the timeout behavior described in the Server Selection cause issues when users specify low operation timeouts.

      Specifically, the CMAP spec describes that if there are no available connections, a connection pool should establish a new connection in-line with the check-out. The CSOT spec describes that the timeout used to create a TCP/TLS connection is min(connectTimeoutMS, min(serverSelectionTimeoutMS, remaining timeoutMS)) and the timeout used to handshake with the MongoDB server should be min(operationTimeout, remaining computedServerSelectionTimeout). As a result, if an operation times out, any in-progress connection establishment necessarily times out as well. If most operations have low timeouts (e.g. 1-5 seconds), the driver may not have enough time to establish any new connections, leading to a state where the driver cannot create any connections.

      Consider the case discovered in the Go driver, which supports client-side operation timeouts via the Go context.Context type:

      The current Go driver connection pool creates most new connections in-line with operations when there are no available idle connections. As a result, the Go driver times out new connection creation on either connectTimeoutMS or the operation context.Context timeout, whichever is shorter. The conflation of operation timeout with new connection creation timeout makes using operation timeouts in the Go driver complicated. The problem is exacerbated when the majority of operations use low operation timeouts (e.g. 1-5 second timeouts), when the operation frequency is high and bursty (e.g 1,000+ op/sec), and when the time to establish a new connection is high (e.g. when TLS and auth handshake are enabled).

      The result is that the current Go driver can enter a state where all connections in a connection pool are perished (usually caused by a connection pool clear due to encountering a timeout during new connection establishment) and almost no new connections can be established in-line with an operation due to the low operation timeout. Customers using the Go driver may encounter application outages as a result of the Go driver’s inability to recover from this state.

      Note that the above was relevant for Go Driver v1.7.x and earlier. See GODRIVER-2038 for more context.

      Any driver that implements a connection pool as described in the current CMAP spec and implments client-side operation timeout as described in the current CSOT spec will potentially encounter the same issue discovered in the Go driver. To prevent that issue, drivers must never let operation timeout influence connection establishment timeout. Honoring both operation timeout and connection establishment timeout requires running connection check-out and connection establishment in different threads. Drivers should continue connection establishment for connectTimeoutMS, even if the check-out that requested the new connection times out.

      Update the CMAP spec to describe the necessary separation of threads of execution between connection check-out and connection establishment. Update the CSOT spec to describe that connection establishment should always use a timeout of connectionTimeoutMS, independent of operation timeout. Note that implementing CSOT likely requires drivers to refactor their connection pool implementations.

      Note that an alternative to separate threads is to always continue establishing connections for connectTimeoutMS, even if the operation timeout has expired.

      Consider the list of drivers that establish connections in the checkOut function

      Motivation

      Who is the affected end user?

      Users using drivers that support client-side operation timeouts, especially users who set low timeouts (1-5 seconds) and run services with high operation volumes (1,000+ op/sec).

      How does this affect the end user?

      The driver may enter a state where it cannot create any new connections for a long period of time. The user's services may experience extended outages if that happens.

      How likely is it that this problem or use case will occur?

      Fairly likely for users who set low timeouts (1-5 seconds) and run services with high operation volumes (1,000+ op/sec).

      If the problem does occur, what are the consequences and how severe are they?

      The user's services may experience extended periods where the driver cannot establish connections and cannot do any work, either at startup or intermittently during the operation of the service.

      Is this issue urgent?

      Must be completed before DRIVERS-555 can be implemented in most drivers.

      Is this ticket required by a downstream team?

      No.

      Is this ticket only for tests?

      No.

            Assignee:
            shane.harvey@mongodb.com Shane Harvey
            Reporter:
            matt.dale@mongodb.com Matt Dale
            Steve Silvester Steve Silvester
            Esha Bhargava Esha Bhargava
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated: