[SERVER-31916] Initial request to a shardsvr mongod can return a clustertime signed with the null key Created: 10/Nov/17  Updated: 30/Oct/23  Resolved: 15/Dec/17

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.6.0-rc4
Fix Version/s: 3.7.1

Type: Bug Priority: Major - P3
Reporter: Mira Carey Assignee: Misha Tyulenev
Resolution: Fixed Votes: 0
Labels: todo_in_code
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-33585 Do not return $clusterTime when no ke... Closed
related to SERVER-43516 Complete TODO listed in SERVER-31916 Closed
is related to PYTHON-1434 pymongo resends client metadata after... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding 2018-01-01, Sharding 2017-12-18
Participants:
Case:

 Description   

When interacting with a mongod in a sharded cluster, the first time a client connects directly to a mongod (instead of via mongos) it can receive a null signed clustertime. Ordinarily, this will only happen when the client has the special authorized to advance clock privilege, but it can also happen the first time an unprivileged client communicates (if that's before keys have been synced).

When that client later attempts to gossip the time, they can receive a

Cache Reader No keys found for HMAC that is valid for time: { ts: Timestamp 1510338396000|21 } with id: 0

style error. This will only occur when the cluster itself has auth enabled (as otherwise no validation is necessary).

For current tests, that involves blacklisting:

  • jstests/sharding/aggregation_currentop.js
  • jstests/sharding/auth_slaveok_routing.js

and forcing jstests/libs/override_methods/validate_collections_on_shutdown.js to abort if it sees KeyNotFound.

We should come up with a strategy to handle this and remove the blacklist



 Comments   
Comment by Githook User [ 15/Dec/17 ]

Author:

{'name': 'Misha Tyulenev', 'email': 'misha@mongodb.com', 'username': 'mikety'}

Message: SERVER-31916 wait for clusterTime on mongo connection
Branch: master
https://github.com/mongodb/mongo/commit/36874a480ea4e2be33af298777a8d6824b7b974e

Comment by Misha Tyulenev [ 13/Dec/17 ]

behackett the specific check will introduce the dependency on the $clusterTime format. and it may affect our ability to change this field in the future releases. So I suggest to not assume a specific $clusterTime structure if there is a forward compatibility requirement.

Comment by Bernie Hackett [ 07/Dec/17 ]

Could we make "$clusterTime.signature.keyId === 0" a valid check that drivers can do, or provide some other way for a driver to know that it shouldn't gossip a particular $clusterTime value?

Comment by Misha Tyulenev [ 07/Dec/17 ]

Its a valid state for a mongod to be available but return dummy signature. While dummy signatures are easy to recognize, as the $clusterTime.signature.keyId === 0 I don't advise on drivers making any assumptions about $clusterTime format
The server guarantee is signing all the errors with the $clusterTime that uses the correct keys insignature (it server have them), so the driver suppose to recover by retrying the command on the same connection.

There is a way to make it more reliable by adding a refresh to the time signing code on mongod but this may cause slight performance degradation, so let me know how important this is.

Still the scenarios where mongod is unable to respond due to validation errors is possible but less likely

Comment by Bernie Hackett [ 07/Dec/17 ]

jeff.yemin has a theory that this will cause problems for drivers in general.

Couldn't you get in to a bad situation even if we fix the python bug?
for instance:
monitor thread calls isMaster
receives response with bad cluster time and stores in the global clock
before the next monitor thread call, the application initiates a command, which will include the bad cluster time
I'm assuming that the isMaster response with the bad cluster time was "successful"
and causes SDAM state machine to make the server available

Granted, this assumes you are connecting directly to a shard on purpose.

Comment by Misha Tyulenev [ 07/Dec/17 ]

This is a self-fixing issue because the error returns the correct signature. This change patches the shell to wait for the valid signature in the ping response.

Comment by Andy Schwerin [ 07/Dec/17 ]

I believe the plan is to work around the behavior in the shell. Avoiding the race condition that causes it in the server is difficult, and the behavior should only affect newly started servers and new clusters. misha.tyulenev, can you confirm?

Comment by Misha Tyulenev [ 13/Nov/17 ]

It does not seem to be a blocker because:
1) it may not happen to connections to mongos (mongos always waits for keys)
2) in replicasets its possible if timing is unfortunate (only the beginning of the RS start when keys not generated yet) but its self correcting as the error suppose to return the valid $clusterTime so the follow up retries will work
Once fixed it must be backported to 3.6

Comment by Ian Whalen (Inactive) [ 13/Nov/17 ]

acm since this is 3.6 Required I'm assigning to Platforms to make sure it doesn't get lost.

Generated at Thu Feb 08 04:28:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.