[DRIVERS-2552] Remove average and 90th percentile RTT times from server description Created: 15/Feb/23  Updated: 16/Jun/23  Resolved: 16/Jun/23

Status: Closed
Project: Drivers
Component/s: SDAM
Fix Version/s: None

Type: Spec Change Priority: Unknown
Reporter: Matt Dale Assignee: Shane Harvey
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to GODRIVER-2691 Deprecate "description.Server" fields... Closed
Driver Changes: Needed

 Description   

Summary

Most fields on a server description as described in the ServerDescription section of the SDAM spec come directly from the most recent connection handshake. However roundTripTime and ninetiethPercentileRoundTripTime don't come from the handshake information.

The Measuring RTT section of the SDAM spec describes that drivers must set the roundTripTime and ninetiethPercentileRoundTripTime from the current average and 90th percentile RTTs for that server, respectively, as measured by the RTT monitor:

When constructing a ServerDescription from a streaming hello or legacy hello response, clients MUST use average and 90th percentile round trip times from the RTT task.

The problem with that approach is that setting an arbitrary snapshot of a constantly-updated value can lead to confusion and bugs when developers need up-to-date RTT data. Since the SDAM spec doesn't describe whether the roundTripTime and ninetiethPercentileRoundTripTime values on a server description should or must be consistently updated, driver devs who are unfamiliar with the sources of RTT data likely have no idea whether that RTT data on the server description is up-to-date or not. That may lead driver devs on a potentially time-consuming investigation into the source of and all references to that data, possibly leading to the conclusion that the data is stale and not useful. Worse, a driver dev may come to the wrong conclusion and implement a feature that uses stale RTT information, potentially leading to difficult-to-diagnose bugs.

Considering many drivers do not keep the RTT data on server descriptions up-to-date, and stale RTT data is more confusing than useful, we should remove the roundTripTime and ninetiethPercentileRoundTripTime fields from the server description in the SDAM spec. Instead, all drivers should fetch up-to-date RTT information from the RTT monitor as described in the RTT thread section of SDAM.

Motivation

Who is the affected end user?

Drivers developers.

How does this affect the end user?

They are confused by stale RTT data on server descriptions.

How likely is it that this problem or use case will occur?

Devs who are new to developing features that depend on RTT data likely do not understand if/when different server description fields are updated. The SDAM specification isn't clear about if/when the roundTripTime/ninetiethPercentileRoundTripTime fields on server descriptions should be updated, so the actual behavior may be inconsistent between drivers.

If the problem does occur, what are the consequences and how severe are they?

Devs can spend significant time tracing code and troubleshooting SDAM behaviors trying to determine if/when the roundTripTime/ninetiethPercentileRoundTripTime fields on a server description are updated. This can be wasted time if the dev realizes the RTT values on the server description are not consistently updated because they need up-to-date RTT data. Worse, if the dev comes to the wrong conclusion and uses stale RTT values from a server description, they may implement a feature with difficult-to-diagnose bugs.

Is this issue urgent?

No.

Is this ticket required by a downstream team?

No.

Is this ticket only for tests?

No.



 Comments   
Comment by Shane Harvey [ 16/Jun/23 ]

Thanks Matt! Closing for the reasons above. An additional reason is that ServerDescription is a public api and we don't want to make breaking changes.

Comment by Shane Harvey [ 05/Apr/23 ]

I believe that is a feature to help accomplish this requirement:

If timeoutMS is set, drivers MUST append a maxTimeMS field to commands executed against a MongoDB server using the minRoundTripTime field of the selected server. Note that this value MUST be retrieved during server selection using the servers field of the same TopologyDescription that was used for selection before the selected server's description can be modified. Otherwise, drivers may be subject to a race condition where a server is reset to the default description (e.g. due to an error in the monitoring thread) after it has been selected but before the RTT is retrieved.

We want to avoid the RTT and the server selection info from becoming out of sync. Putting the RTT on the ServerDescription helps accomplish that. It's also better for observability since users can log the RTT info from the ServerDescription.

Comment by Matt Dale [ 01/Apr/23 ]

shane.harvey@mongodb.com that's a good point. However, "stale" in this case refers to staleness relative to the best source of information. For example, the best source of information for server state in the driver is the most recent connection handshake or heartbeat response. The best source of information for round-trip times is the RTT monitor. Storing snapshots of RTT monitor values on a server description can result in stale data relative to the data available from the RTT monitor. That is, if you need round-trip time data, you should always get it from the RTT monitor, so there's no reason to store it on a server description.

Comment by Shane Harvey [ 27/Mar/23 ]

I don't agree with this ticket's rationale because even fields that are reported by the server in the hello command response can become stale. For example, hosts, isWritablePrimary, $clusterTime, operationTime, etc.. In fact, almost every single field can become stale right after the server sends it. I'd recommend closing this ticket and GODRIVER-2691

Generated at Thu Feb 08 08:25:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.