Uploaded image for project: 'Drivers'
  1. Drivers
  2. DRIVERS-2884

CSOT avoid connection churn when operations timeout

    • Type: Icon: Improvement Improvement
    • Resolution: Unresolved
    • Priority: Icon: Unknown Unknown
    • None
    • Component/s: CSOT
    • None
    • Needed
    • Hide

      Summary of necessary driver changes

      •  

      Commits for syncing spec/prose tests
      (and/or refer to an existing language POC if needed)

      •  

      Context for other referenced/linked tickets

      •  
      Show
      Summary of necessary driver changes   Commits for syncing spec/prose tests (and/or refer to an existing language POC if needed)   Context for other referenced/linked tickets  
    • $i18n.getText("admin.common.words.hide")
      Key Status/Resolution FixVersion
      CDRIVER-5526 Blocked
      CXX-2998 Blocked
      CSHARP-5024 Blocked
      GODRIVER-3173 Blocked
      JAVA-5399 Blocked
      NODE-6062 Blocked
      MOTOR-1291 Duplicate
      PYTHON-4324 Blocked
      PHPLIB-1425 Blocked
      RUBY-3432 Blocked
      RUST-1903 Blocked
      $i18n.getText("admin.common.words.show")
      #scriptField, #scriptField *{ border: 1px solid black; } #scriptField{ border-collapse: collapse; } #scriptField td { text-align: center; /* Center-align text in table cells */ } #scriptField td.key { text-align: left; /* Left-align text in the Key column */ } #scriptField a { text-decoration: none; /* Remove underlines from links */ border: none; /* Remove border from links */ } /* Add green background color to cells with FixVersion */ #scriptField td.hasFixVersion { background-color: #00FF00; /* Green color code */ } #scriptField td.willNotDo { background-color: #FF0000; /* Red color code */ } /* Center-align the first row headers */ #scriptField th { text-align: center; } Key Status/Resolution FixVersion CDRIVER-5526 Blocked CXX-2998 Blocked CSHARP-5024 Blocked GODRIVER-3173 Blocked JAVA-5399 Blocked NODE-6062 Blocked MOTOR-1291 Duplicate PYTHON-4324 Blocked PHPLIB-1425 Blocked RUBY-3432 Blocked RUST-1903 Blocked

      Summary

      With CSOT, drivers set the socket read timeout to "remaining timeout" and set maxTimeMS to "remaining timeout - minRTT" so that the server has some time to respond with a MaxTimeMSExpired error. However, in practice the network and server latency varies and it's possible the client hits the socket timeout instead of reading the MaxTimeMSExpired error. This is not ideal because drivers close the connection after hitting a socket timeout which leads to connection churn.

      It would be better to avoid connection churn when operations timeout. One way to accomplish this is to:

      1. keep the connection open after a socket timeout
      2. mark the connection with a pending read
      3. check the connection back into the pool

      Then some subsequent operation will:

      1. check out the connection from the pool
      2. check if the connection has a pending read, if so complete the read.
      3. continue running the next operation normally.

      This design avoids the connection churn. It also implicitly enforces some back pressure as the next operation won't be sent to the server until the pending operation completes.

      A proof of concept in PyMongo is implemented here: https://github.com/ShaneHarvey/mongo-python-driver/commit/d6e4c877d1972a1cba85673eb91a0c1dfcd185a9

      Motivation

      An example of this poor behavior is described in HELP-56519, where a latency spike on the server resulted in many socket timeouts, triggered a connection storm due to the churn, and contributed even more load on the server.

      Who is the affected end user?

      All users of CSOT.

      How does this affect the end user?

      Connection churn + poor performance as connections need to be reopened.

      How likely is it that this problem or use case will occur?

      Likely.

      If the problem does occur, what are the consequences and how severe are they?

      The resulting connection storm can overwhelm nodes in the cluster.

      Is this issue urgent?

      TBD.

      Is this ticket required by a downstream team?

      TBD.

      Is this ticket only for tests?

      Both spec changes and tests.

      Acceptance Criteria

            Assignee:
            preston.vasquez@mongodb.com Preston Vasquez
            Reporter:
            shane.harvey@mongodb.com Shane Harvey
            Shane Harvey Shane Harvey
            KeAna Moutra KeAna Moutra
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: