Uploaded image for project: 'Drivers'
  1. Drivers
  2. DRIVERS-3112

CSOT minRoundTripTime calculation double counts server processing time

    • Type: Icon: Spec Change Spec Change
    • Resolution: Unresolved
    • Priority: Icon: Unknown Unknown
    • None
    • Component/s: CSOT
    • None
    • Needed

      Summary

      The CSOT spec says:

      When constructing a command, drivers use the timeoutMS option to derive a value for the maxTimeMS command option and the socket timeout. The full time to round trip a command is (network RTT + server-side execution time). If both maxTimeMS and socket timeout were set to the same value, the server would never be able to respond with a MaxTimeMSExpired error because drivers would hit the socket timeout first and close the connection. This would lead to connection churn if the specified timeout is too low. To allow the server to gracefully error and avoid churn, drivers must account for the network round trip in the maxTimeMS calculation.

      However the client's RTT measurement includes not just network round trip time but also the server processing time for the hello/ping command. Normally, the processing time for a hello/ping command is very small (under a millisecond) but if the server becomes overloaded that time can increase substantially. This leads to server processing time effectively being double counted towards the command. This can be made more clear with an example, lets say we're at the point where the driver is constructing the command and applying maxTimeMS:

      timeoutMS=1001
      remainingTimeoutMS=1000
      minRttMS=10   # normal operations, rtt is all network latency
      trueNetworkLatency=9
      maxTimeMS=990
      

      This scenario is fine since minRttMS is roughly equal to network latency. Next, let's assume the server is overloaded and minRttMS increases from 10ms to 600ms:

      timeoutMS=1001
      remainingTimeoutMS=1000
      minRttMS=600   # server overloaded, rtt is mostly server latency
      trueNetworkLatency=9  # same as before.
      maxTimeMS=400
      

      So now the command only has 400ms to execute on the server when it actually should be given ~990ms. This doesn't make sense because it means that when the server is running slowly we give operations less time to execute.

      Motivation

      Who is the affected end user?

      Any user of CSOT.

      How does this affect the end user?

      It also causes more operations to fail with MaxTimeMSExpired errors because the maxTimeMS is artificially lower than it should be.

      How likely is it that this problem or use case will occur?

      When the server is overloaded and CSOT is in use.

      If the problem does occur, what are the consequences and how severe are they?

      Performance concerns.

      Is this issue urgent?

      TBD.

      Is this ticket required by a downstream team?

      No.

      Acceptance Criteria

      TBD.

      Thanks to jeff.yemin@mongodb.com for discovering this problem.

            Assignee:
            Unassigned Unassigned
            Reporter:
            shane.harvey@mongodb.com Shane Harvey
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: