[GODRIVER-2127] Go Driver should have a retry after invalid ClusterTime Created: 11/Aug/21  Updated: 27/Oct/23  Resolved: 27/Sep/21

Status: Closed
Project: Go Driver
Component/s: None
Affects Version/s: 1.6.0
Fix Version/s: None

Type: New Feature Priority: Unknown
Reporter: Vinicius Grippa Assignee: Matt Dale
Resolution: Gone away Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to DRIVERS-1904 Handle invalid $clusterTime documents... Backlog

 Description   

Similar to https://jira.mongodb.org/browse/PYTHON-1434

Go Driver should have a retry option when receiving a bad ClusterTime. Otherwise, similar issue is observed:

2021-08-11T01:27:14.161Z I COMMAND [conn27172109] command admin.$cmd appName: "<redacted>" command: isMaster { ismaster: 1, $clusterTime: { clusterTime: Timestamp(1628645233, 6), signature:

{ keyId: 0, hash: BinData(0, 0000000000000000000000000000000000000000) }

}, $db: "admin", $readPreference:

{ mode: "primary" }

} numYields:0 ok:0 errMsg:"Cache Reader No keys found for HMAC that is valid for time:

{ ts: Timestamp(1628645233, 6) }

with id: 0" errName:KeyNotFound errCode:211 reslen:315 protocol:op_msg 2ms



 Comments   
Comment by PM Bot [ 27/Sep/21 ]

There hasn't been any recent activity on this ticket, so we're resolving it. Thanks for reaching out! Please feel free to comment on this if you're able to provide more information.

Comment by Matt Dale [ 11/Sep/21 ]

vgrippa@gmail.com thanks for the info about your MongoDB deployment. I've started a discussion about updating the drivers specification to handle the KeyNotFound error in DRIVERS-1904.

As far as the problems that are causing the driver to encounter KeyNotFound errors in the first place, I have two theories:

  • MongoDB nodes with authentication disabled will return a "dummy signed" $clusterTime document. If your cluster includes nodes with authentication both enabled and disabled (e.g. if you have --transitionToAuth enabled on any nodes), you may run into a case where the driver attempts to "gossip" a $clusterTime document with a dummy signature and encounters a KeyNotFound error when it sends an operation to a node with authentication enabled. Do you use --transitionToAuth on any nodes in your cluster, or do you have any nodes that have authentication disabled?
  • Something like SERVER-47568 or SERVER-52955 could cause the server to return a KeyNotFound error even if the $clusterTime doc specifies a valid keyId. The 4.0 fix version that resolves both of those bugs is 4.0.24. Consider updating your cluster to use MongoDB 4.0.24 or newer to avoid that error condition.

As far as reproducing the error manually, I modified Operation.updateClusterTimes() in the Go driver with something like the following code:

// ...
clusterTime := bsoncore.BuildDocumentFromElements(...)
 
sess, clock := op.Client, op.Clock
 
// START MODIFIED CODE //
// Only fake the $clusterTime if the server response included a $clusterTime because
// we need to know the timestamp. Only fake the $clusterTime the first time it's available
// in the server response (use a sync.Once).
if len(clusterTime) > 0 {
	once.Do(func() {
		// Generate a $clusterTime document with a greater timestamp than the actual
		// $clusterTime and set the hash to 20 zero bytes and keyID to 0.
		// E.g. faked doc as bson.D:
		// bson.D{{"$clusterTime", bson.D{
		// 	{"clusterTime", primitive.Timestamp{T: t + 10, I: i}},
		// 	{"signature", bson.D{
		// 		{"hash", primitive.Binary{Subtype: 0x0, Data: []byte{/* 20 0x0 bytes */}}},
		// 		{"keyId", int64(0)},
		// 	}},
		// }}}
		fakedDoc := makeClusterTimeWithZeroKeyID(clusterTime)
		if sess != nil {
			_ = sess.AdvanceClusterTime(bson.Raw(fakedDoc))
		}
 
		if clock != nil {
			clock.AdvanceClusterTime(bson.Raw(fakedDoc))
		}
		return
	})
}
// END MODIFIED CODE //
 
// ...

Comment by Vinicius Grippa [ 29/Aug/21 ]

Hi Matt,

Thanks for checking. Let me answer your questions:

What MongoDB server version are you using?
A: MongoDB 4.0.16
What MongoDB topology are you using (e.g. single server, replica set, sharded cluster, etc)?
A: Sharded cluster with multiple routers (mongoS).
Is your MongoDB deployment managed by Atlas or self-hosted?
A: It is self-hosted. If you can share your tests and how to modify the clusterTime I can test in my staging environment. As I mentioned, I didn't find a way to reproduce the issue manually so far.

Let me know if this helps.

Comment by Matt Dale [ 26/Aug/21 ]

Hey vgrippa@gmail.com, thanks for the extra information, that's really helpful! I was able to reproduce the error you described by manually overwriting the $clusterTime doc returned from a server with one with an invalid keyId and an increased timestamp. I'm investigating possible ways for the driver to recover from getting an invalid $clusterTime doc from the server.

I've got a few questions to help me understand your use case better:

  • What MongoDB server version are you using?
  • What MongoDB topology are you using (e.g. single server, replica set, sharded cluster, etc)?
  • Is your MongoDB deployment managed by Atlas or self-hosted?
Comment by Vinicius Grippa [ 17/Aug/21 ]

Hi Matt,

Thanks for checking. Maybe you might find this useful:

Downgrading to 1.3.0 solves the issue. I believe it is because Go Driver 1.3.0 does not check the cluster time:

type HandshakeOptions struct

{ AppName string Authenticator Authenticator Compressors []string DBUser string PerformAuthentication func(description.Server) bool }

...

desc, err := operation.NewIsMaster().
AppName(ah.options.AppName).
Compressors(ah.options.Compressors).
SASLSupportedMechs(ah.options.DBUser).
GetDescription(ctx, addr, conn)

1.3.0

While Go Driver 1.6.0 checks:

op := operation.NewIsMaster().
AppName(ah.options.AppName).
Compressors(ah.options.Compressors).
SASLSupportedMechs(ah.options.DBUser).
ClusterClock(ah.options.ClusterClock).
ServerAPI(ah.options.ServerAPI).
LoadBalanced(ah.options.LoadBalanced)

1.6.0

Hope it helps.

Comment by Matt Dale [ 17/Aug/21 ]

Hey vgrippa@gmail.com, thanks for reporting this issue! We will look into it soon.

Generated at Thu Feb 08 08:37:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.