[DRIVERS-2798] Gossiping the cluster time from monitoring connections can result in loss of availability Created: 18/Dec/23  Updated: 23/Jan/24

Status: Backlog
Project: Drivers
Component/s: Sessions
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Jeffrey Yemin Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: availability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to JAVA-5256 Switching replicas IP with a replica ... Closed
Driver Changes: Needed
Quarter: FY25Q2

 Description   

Summary

In unusual situations, gossiping the cluster time received on monitoring connections results in complete loss of availability and requires an application restart. The problem was traced to a temporary state during which the driver attempts to connect to a member of the wrong replica set running on the same pod.  Since cluster times between deployments are not compatible, it results in all operations failing until the application is restarted.

Motivation

Who is the affected end user?

We only have one report of this, in JAVA-5256.  Please see that ticket for details, as they are quite involved.

How does this affect the end user?

Availability is completely compromised and an application restart is required.

How likely is it that this problem or use case will occur?

It's certainly unusual, as we have not heard other reports of this from people using our Kubernetes operator.  On the other hand, the fix is likely simple for most drivers, though testing is an issue (there are probably no tests of the existing behavior)

If the problem does occur, what are the consequences and how severe are they?

Complete loss of availability to the desired cluster.

Is this issue urgent?

The user has no simple workaround, but it is possible to work around

Is this ticket required by a downstream team?

No

Is this ticket only for tests?

No

Acceptance Criteria

The requirement is for a clarification to the sessions specification, saying that cluster time gossiping should be limited to pooled connections and should not include monitoring connections.  It's unclear though how a test could be written.  In a POC of this in the Java driver, it was achieved by a simple design change that made it impossible to gossip the cluster time for monitoring connections, but it's certainly possible that a future design change could reverse that and the issue could be re-introduced.

Additional Notes

Gossiping of cluster time has been a bit of a mystery to many driver engineers, as the specification contains no rationale for it. Discussions with server engineers recently have revealed the following justification:

  • In a sharded cluster, each shard has an independent monotonically increasing logical clock
  • Every write on the shard includes the current logical clock time
  • The gossiping pushes the logical clock forward to just past the gossiped time
  • This means that a client thread that does a write that targets shard A, then a subsequent write to shard B, will result in the second write having a later time than the first write
  • This in turn means that the first write will precede the second write in various operations which create a total ordering of write operations. A change stream is the primary example.

Since monitoring connections are never used for writes, there is no benefit to gossiping cluster times from those connections



 Comments   
Comment by Preston Vasquez [ 23/Jan/24 ]

This change seems reasonable, noting that this concern is also not guarded against server-side:

Another possibility is that the IP addresses of the mongos servers themselves are getting all mixed up, and a single client ends up connecting to mongos server from different deployments. If that's what's happening, the solution proposed here is not sufficient. And really, the current behavior is probably the safest, as it prevents the client from reading/writing data from more than one deployment using a single MongoClient.

From HELP-53593:

So it looks like mongos replica set monitoring does not participate in cluster time gossiping.

Ultimately though, reassigning IP addresses from members of one cluster to another without also updating the state of those clusters that was based on the old addresses (whether it be connection strings in the driver, shard state, or replica set config) is likely going to cause issues.

Comment by Shane Harvey [ 02/Jan/24 ]

Ah that's a good point. I had not considered the differing system clocks.

Comment by Jeffrey Yemin [ 02/Jan/24 ]

Yeah, that statement is probably not accurate. It's probably fairer to say that an indeterminate number of operations will fail. Consider, for example, if the server running the replica set member from a different cluster has a system clock that is ahead of the other servers. Since the driver only replaces the cluster time with a newer cluster time (by comparing its timestamp portion), then the cluster time from the monitoring connection to the "bad" server will remain as the MongoClient's cluster time regardless of any other cluster times from the other servers. But it could also be the opposite: if the cluster time from the "bad" server is older than ones from the correct servers, its cluster time will not replace any from the good servers.

Comment by Shane Harvey [ 02/Jan/24 ]

This change seems fine to me especially considering that in load balanced mode there's no SDAM so the driver already works in a similar way. I do have one question about this part:

Since cluster times between deployments are not compatible, it results in all operations failing until the application is restarted.

Wouldn't it only result in 0 or 1 operations failing since the error returned by the server will include the correct $clusterTime? Drivers are supposed to update $clusterTime from all server responses, including command errors.

Comment by Jeffrey Yemin [ 19/Dec/23 ]

This issue is also being reported by a customer on v6.0.4 via 01237591, but in that case it's a sharded cluster. It's possible that mongos has the same issue that the drivers do in this regard.

Another possibility is that the IP addresses of the mongos servers themselves are getting all mixed up, and a single client ends up connecting to mongos server from different deployments.  If that's what's happening, the solution proposed here is not sufficient.  And really, the current behavior is probably the safest, as it prevents the client from reading/writing data from more than one deployment using a single MongoClient.

patrick.freed@mongodb.com points out on the related HELP-53593 that mongos shard monitoring does not participate in cluster time gossiping, which is more reason that drivers should do the same.

Generated at Thu Feb 08 08:26:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.