[DRIVERS-1904] Handle invalid $clusterTime documents when gossiping cluster time Created: 27/Aug/21 Updated: 31/Mar/22 |
|
| Status: | Backlog |
| Project: | Drivers |
| Component/s: | Sessions |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Matt Dale | Assignee: | Unassigned |
| Resolution: | Unresolved | Votes: | 2 |
| Labels: | leads-triage | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Driver Changes: | Needed | ||||||||||||||||
| Description |
SummaryMongoDB 3.6+ replica sets and sharded clusters return a $clusterTime document on all operations. Drivers are required to send the newest seen $clusterTime document on all operations (called gossiping cluster time). E.g. $clusterTime document:
A server may respond with a $clusterTime document containing a "dummy signed" cluster time that specifies keyId: 0. If that happens, subsequent operations that gossip the new $clusterTime document with keyId: 0 may get a KeyNotFound server error instead of the expected operation result. New operations may continue to fail until a server response includes a newer $clusterTime document (i.e. with a greater timestamp) containing a valid signature with valid keyId. The proposed improved behavior when the server responds with a KeyNotFound error (code 211):
MotivationWho is the affected end user?Users who encounter the KeyNotFound server error caused by receiving an invalid $clusterTime document from the server while attempting to run operations. See related tickets How does this affect the end user?Users who encounter the KeyNotFound server error may encounter up to a 100% operation error rate until a server responds with a newer, valid $clusterTime document. The cluster time is only advanced on a write operation, so the client's $clusterTime document will be updated as soon as a write happens and the client performs another operation. How likely is it that this problem or use case will occur?Conditions required:
If those conditions happen, when the driver sends any operation to a server that has a lower $clusterTime timestamp, the server will respond with a KeyNotFound error. Simplified example of what's happening:
If the problem does occur, what are the consequences and how severe are they?Some percentage of operations, up to 100%, will fail until the driver receives a new, valid $clusterTime document from a server response (until another write happens that advances the cluster time). The percentage of operations that fail depends on the percentage of operations sent to servers that have a $clusterTime timestamp lower than the one sent on the operation. For example, if the driver is sending all read/write operations to the primary server in a replicaset, it's unlikely to impossible that the Client has a $clusterTime document that is newer than the one on the primary server because all $clusterTime documents are coming from the primary server. However, if the driver is configured to send writes to a primary and reads to a secondary, a $clusterTime document received from a primary could have a greater timestamp than the current $clusterTime document on the secondary. Subsequent read operations sent to the secondary would include the newer $clusterTime document and could cause a KeyNotFound error if the keyId on the $clusterTime document is invalid. Is this issue urgent?No. Is this ticket required by a downstream team?No. Is this ticket only for tests?No. |