-
Type:
Spec Change
-
Resolution: Unresolved
-
Priority:
Unknown
-
None
-
Component/s: CSOT
-
None
-
Needed
Summary
The CSOT spec says:
When constructing a command, drivers use the timeoutMS option to derive a value for the maxTimeMS command option and the socket timeout. The full time to round trip a command is (network RTT + server-side execution time). If both maxTimeMS and socket timeout were set to the same value, the server would never be able to respond with a MaxTimeMSExpired error because drivers would hit the socket timeout first and close the connection. This would lead to connection churn if the specified timeout is too low. To allow the server to gracefully error and avoid churn, drivers must account for the network round trip in the maxTimeMS calculation.
However the client's RTT measurement includes not just network round trip time but also the server processing time for the hello/ping command. Normally, the processing time for a hello/ping command is very small (under a millisecond) but if the server becomes overloaded that time can increase substantially. This leads to server processing time effectively being double counted towards the command. This can be made more clear with an example, lets say we're at the point where the driver is constructing the command and applying maxTimeMS:
timeoutMS=1001 remainingTimeoutMS=1000 minRttMS=10 # normal operations, rtt is all network latency trueNetworkLatency=9 maxTimeMS=990
This scenario is fine since minRttMS is roughly equal to network latency. Next, let's assume the server is overloaded and minRttMS increases from 10ms to 600ms:
timeoutMS=1001 remainingTimeoutMS=1000 minRttMS=600 # server overloaded, rtt is mostly server latency trueNetworkLatency=9 # same as before. maxTimeMS=400
So now the command only has 400ms to execute on the server when it actually should be given ~990ms. This doesn't make sense because it means that when the server is running slowly we give operations less time to execute.
Motivation
Who is the affected end user?
Any user of CSOT.
How does this affect the end user?
It also causes more operations to fail with MaxTimeMSExpired errors because the maxTimeMS is artificially lower than it should be.
How likely is it that this problem or use case will occur?
When the server is overloaded and CSOT is in use.
If the problem does occur, what are the consequences and how severe are they?
Performance concerns.
Is this issue urgent?
TBD.
Is this ticket required by a downstream team?
No.
Acceptance Criteria
TBD.
Thanks to jeff.yemin@mongodb.com for discovering this problem.