[DRIVERS-2035] Use minimum RTT for CSOT maxTimeMS calculation instead of 90th percentile Created: 20/Jan/22  Updated: 16/Jan/24

Status: Implementing
Project: Drivers
Component/s: CSOT
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Shane Harvey Assignee: Shane Harvey
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Issue split
split to NODE-5825 Add minRoundTripTime field and calcul... Scheduled
split to GODRIVER-2762 Use minimum RTT for CSOT maxTimeMS ca... Closed
split to PYTHON-3616 Use minimum RTT for CSOT maxTimeMS ca... Closed
Related
is related to NODE-3078 Client Side Operations Timeout Implementing
Epic Link: DRIVERS-555
Driver Changes: Not Needed
Quarter: FY23Q2, FY23Q3, FY24Q2
Downstream Changes Summary:

Drivers must use the minimum RTT for CSOT maxTimeMS calculation instead of 90th percentile. At least 2 RTT samples are required otherwise drivers must use 0 as RTT. Only keep at most the last 10 samples. These changes were made to avoid preemptively failing operations due to inaccurate or unstable RTT measurements.

Spec change commit: https://github.com/mongodb/specifications/commit/c06650d86f7e47ea30cb2d992942bcec6ef155f9
Spec change PR: https://github.com/mongodb/specifications/pull/1350

Start date:
Driver Compliance:
Key Status/Resolution FixVersion
PYTHON-3616 Fixed 4.4
GODRIVER-2762 Fixed 2.0.0
NODE-5825 Scheduled

 Description   

In the PR review for the timeout spec matt.dale provided a suggestion which was never resolved. To quote:

Using the 90th percentile RTT latency will result in some operations that are likely to complete being cancelled instead.

Let's consider a Find operation that completes quickly on the server (i.e. <1ms) running on an Atlas cluster, so almost all of the latency is from the network round trip. There are 3 buckets of timing conditions the driver will encounter:

  1. The client-side deadline is greater than (now + max observed RTT); the operation will almost certainly complete before the deadline.
  2. The client-side deadline is between [(now + min observed RTT), (now + max observed RTT)]; the operation may complete or may fail due to timeout.
  3. The client-side deadline is less than (now + min observed RTT); the operation will almost certainly fail due to timeout.

The operations we're interested in are in bucket 2. By assuming the network round trip will take the 90th percentile observed RTT, we may cancel operations that have a nearly 90% chance of completing before the deadline. Cancelling operations is dangerous because we're actually preventing the driver from doing work. We should instead bias toward cancelling as few operations that have a reasonable chance of completing as possible, in exchange for also letting more operations time out.

I propose that we change the cancellation threshold to the 5-minute minimum RTT (i.e. minimum RTT observed in the last 5 minutes) instead of the 90th percentile. While the 10th or 25th percentile more closely match the "reasonable chance of succeeding" threshold, the added complexity of using the t-digest algorithm doesn't seem to justify the small optimization.

We should reconsider the 90th RTT heuristic used for preventing sending an operation and setting maxTimeMS.



 Comments   
Comment by Githook User [ 16/Feb/23 ]

Author:

{'name': 'Shane Harvey', 'email': 'shnhrv@gmail.com', 'username': 'ShaneHarvey'}

Message: DRIVERS-2035 Use minimum RTT for CSOT maxTimeMS calculation instead of 90th percentile (#1350)

Require at least 2 RTT samples, otherwise use 0 as RTT. Only keep last 10 samples.
Update tests to wait for multiple RTTs.
Branch: master
https://github.com/mongodb/specifications/commit/c06650d86f7e47ea30cb2d992942bcec6ef155f9

Generated at Thu Feb 08 08:24:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.