[DRIVERS-2552] Remove average and 90th percentile RTT times from server description Created: 15/Feb/23 Updated: 16/Jun/23 Resolved: 16/Jun/23 |
|
| Status: | Closed |
| Project: | Drivers |
| Component/s: | SDAM |
| Fix Version/s: | None |
| Type: | Spec Change | Priority: | Unknown |
| Reporter: | Matt Dale | Assignee: | Shane Harvey |
| Resolution: | Won't Do | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Driver Changes: | Needed | ||||||||
| Description |
SummaryMost fields on a server description as described in the ServerDescription section of the SDAM spec come directly from the most recent connection handshake. However roundTripTime and ninetiethPercentileRoundTripTime don't come from the handshake information. The Measuring RTT section of the SDAM spec describes that drivers must set the roundTripTime and ninetiethPercentileRoundTripTime from the current average and 90th percentile RTTs for that server, respectively, as measured by the RTT monitor:
The problem with that approach is that setting an arbitrary snapshot of a constantly-updated value can lead to confusion and bugs when developers need up-to-date RTT data. Since the SDAM spec doesn't describe whether the roundTripTime and ninetiethPercentileRoundTripTime values on a server description should or must be consistently updated, driver devs who are unfamiliar with the sources of RTT data likely have no idea whether that RTT data on the server description is up-to-date or not. That may lead driver devs on a potentially time-consuming investigation into the source of and all references to that data, possibly leading to the conclusion that the data is stale and not useful. Worse, a driver dev may come to the wrong conclusion and implement a feature that uses stale RTT information, potentially leading to difficult-to-diagnose bugs. Considering many drivers do not keep the RTT data on server descriptions up-to-date, and stale RTT data is more confusing than useful, we should remove the roundTripTime and ninetiethPercentileRoundTripTime fields from the server description in the SDAM spec. Instead, all drivers should fetch up-to-date RTT information from the RTT monitor as described in the RTT thread section of SDAM. MotivationWho is the affected end user?Drivers developers. How does this affect the end user?They are confused by stale RTT data on server descriptions. How likely is it that this problem or use case will occur?Devs who are new to developing features that depend on RTT data likely do not understand if/when different server description fields are updated. The SDAM specification isn't clear about if/when the roundTripTime/ninetiethPercentileRoundTripTime fields on server descriptions should be updated, so the actual behavior may be inconsistent between drivers. If the problem does occur, what are the consequences and how severe are they?Devs can spend significant time tracing code and troubleshooting SDAM behaviors trying to determine if/when the roundTripTime/ninetiethPercentileRoundTripTime fields on a server description are updated. This can be wasted time if the dev realizes the RTT values on the server description are not consistently updated because they need up-to-date RTT data. Worse, if the dev comes to the wrong conclusion and uses stale RTT values from a server description, they may implement a feature with difficult-to-diagnose bugs. Is this issue urgent?No. Is this ticket required by a downstream team?No. Is this ticket only for tests?No. |
| Comments |
| Comment by Shane Harvey [ 16/Jun/23 ] |
|
Thanks Matt! Closing for the reasons above. An additional reason is that ServerDescription is a public api and we don't want to make breaking changes. |
| Comment by Shane Harvey [ 05/Apr/23 ] |
|
I believe that is a feature to help accomplish this requirement:
We want to avoid the RTT and the server selection info from becoming out of sync. Putting the RTT on the ServerDescription helps accomplish that. It's also better for observability since users can log the RTT info from the ServerDescription. |
| Comment by Matt Dale [ 01/Apr/23 ] |
|
shane.harvey@mongodb.com that's a good point. However, "stale" in this case refers to staleness relative to the best source of information. For example, the best source of information for server state in the driver is the most recent connection handshake or heartbeat response. The best source of information for round-trip times is the RTT monitor. Storing snapshots of RTT monitor values on a server description can result in stale data relative to the data available from the RTT monitor. That is, if you need round-trip time data, you should always get it from the RTT monitor, so there's no reason to store it on a server description. |
| Comment by Shane Harvey [ 27/Mar/23 ] |
|
I don't agree with this ticket's rationale because even fields that are reported by the server in the hello command response can become stale. For example, hosts, isWritablePrimary, $clusterTime, operationTime, etc.. In fact, almost every single field can become stale right after the server sends it. I'd recommend closing this ticket and |