[DRIVERS-2798] Gossiping the cluster time from monitoring connections can result in loss of availability Created: 18/Dec/23 Updated: 23/Jan/24 |
|
| Status: | Backlog |
| Project: | Drivers |
| Component/s: | Sessions |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Jeffrey Yemin | Assignee: | Unassigned |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | availability | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Driver Changes: | Needed | ||||||||
| Quarter: | FY25Q2 | ||||||||
| Description |
SummaryIn unusual situations, gossiping the cluster time received on monitoring connections results in complete loss of availability and requires an application restart. The problem was traced to a temporary state during which the driver attempts to connect to a member of the wrong replica set running on the same pod. Since cluster times between deployments are not compatible, it results in all operations failing until the application is restarted. MotivationWho is the affected end user?We only have one report of this, in How does this affect the end user?Availability is completely compromised and an application restart is required. How likely is it that this problem or use case will occur? It's certainly unusual, as we have not heard other reports of this from people using our Kubernetes operator. On the other hand, the fix is likely simple for most drivers, though testing is an issue (there are probably no tests of the existing behavior) If the problem does occur, what are the consequences and how severe are they?Complete loss of availability to the desired cluster. Is this issue urgent?The user has no simple workaround, but it is possible to work around Is this ticket required by a downstream team?No Is this ticket only for tests? No Acceptance CriteriaThe requirement is for a clarification to the sessions specification, saying that cluster time gossiping should be limited to pooled connections and should not include monitoring connections. It's unclear though how a test could be written. In a POC of this in the Java driver, it was achieved by a simple design change that made it impossible to gossip the cluster time for monitoring connections, but it's certainly possible that a future design change could reverse that and the issue could be re-introduced. Additional NotesGossiping of cluster time has been a bit of a mystery to many driver engineers, as the specification contains no rationale for it. Discussions with server engineers recently have revealed the following justification:
Since monitoring connections are never used for writes, there is no benefit to gossiping cluster times from those connections |
| Comments |
| Comment by Preston Vasquez [ 23/Jan/24 ] |
|
This change seems reasonable, noting that this concern is also not guarded against server-side:
From HELP-53593:
|
| Comment by Shane Harvey [ 02/Jan/24 ] |
|
Ah that's a good point. I had not considered the differing system clocks. |
| Comment by Jeffrey Yemin [ 02/Jan/24 ] |
|
Yeah, that statement is probably not accurate. It's probably fairer to say that an indeterminate number of operations will fail. Consider, for example, if the server running the replica set member from a different cluster has a system clock that is ahead of the other servers. Since the driver only replaces the cluster time with a newer cluster time (by comparing its timestamp portion), then the cluster time from the monitoring connection to the "bad" server will remain as the MongoClient's cluster time regardless of any other cluster times from the other servers. But it could also be the opposite: if the cluster time from the "bad" server is older than ones from the correct servers, its cluster time will not replace any from the good servers. |
| Comment by Shane Harvey [ 02/Jan/24 ] |
|
This change seems fine to me especially considering that in load balanced mode there's no SDAM so the driver already works in a similar way. I do have one question about this part:
Wouldn't it only result in 0 or 1 operations failing since the error returned by the server will include the correct $clusterTime? Drivers are supposed to update $clusterTime from all server responses, including command errors. |
| Comment by Jeffrey Yemin [ 19/Dec/23 ] |
|
This issue is also being reported by a customer on v6.0.4 via 01237591, but in that case it's a sharded cluster. It's possible that mongos has the same issue that the drivers do in this regard. Another possibility is that the IP addresses of the mongos servers themselves are getting all mixed up, and a single client ends up connecting to mongos server from different deployments. If that's what's happening, the solution proposed here is not sufficient. And really, the current behavior is probably the safest, as it prevents the client from reading/writing data from more than one deployment using a single MongoClient. patrick.freed@mongodb.com points out on the related HELP-53593 that mongos shard monitoring does not participate in cluster time gossiping, which is more reason that drivers should do the same. |