[SERVER-32639] Arbiters in standalone replica sets can't sign or validate clusterTime with auth on once FCV checks are removed Created: 10/Jan/18 Updated: 30/Oct/23 Resolved: 13/Apr/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 3.6.7, 3.7.4 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Jack Mulrow | Assignee: | Misha Tyulenev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | todo_in_code | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||
| Backport Requested: |
v3.6
|
||||||||||||||||||||||||||||
| Sprint: | Sharding 2018-03-26, Sharding 2018-04-09, Sharding 2018-04-23 | ||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Description |
|
Because arbiters don't persist any data, they never replicate the admin.system.keys collection, which replica set members read from locally to load the keys used for clusterTime signing and validation. Currently, this issue was masked in our tests because clusterTimes aren't signed or validated when FCV is not fully upgraded to v3.6, but since arbiters also don't persist the admin.system.version collection, they never update their in-memory FCV and stay at v3.4. After we remove the FCV checks in This shouldn't be a problem in sharded clusters, since keys are only persisted on the CSRS and are cached in memory on every other node in the cluster. Implementation:The simplest and most efficient way is not to install logical clock if its an arbiter node: https://github.com/mongodb/mongo/blob/r3.7.3/src/mongo/db/db.cpp#L780 Sharding part: jack.mulrow please ack. 1. Add
and
Alternative: do not create LogicalClock optionally anymore: I prefer to not change more than needed mostly because the performance is hurt by taking the lock every time LogicalClock data is accessed. 2. Other LogicalClock public methods should invariant(_isEnabled); so there is no accidental calls to a disabled logical clock. 3.Do not validate or advance logicalTime if its not enabled. https://github.com/mongodb/mongo/blob/r3.7.3/src/mongo/rpc/metadata.cpp#L102 Replication part: judah.schvimer please ack. 5. https://github.com/mongodb/mongo/blob/r3.7.3/src/mongo/db/db.cpp#L539 initilizes replication coordinator so it can tell if the current node is an arbiter. Note: Monitor keys: https://github.com/mongodb/mongo/blob/r3.7.3/src/mongo/db/db.cpp#L527 happens before this line. I dont think it can be moved later as keys may be needed for proper initialization of oplog |
| Comments |
| Comment by Githook User [ 23/Jul/18 ] |
|
Author: {'name': 'Misha Tyulenev', 'email': 'misha@mongodb.com', 'username': 'mikety'}Message: |
| Comment by Misha Tyulenev [ 16/Apr/18 ] |
|
While 3.6 is not affected by this issue due to another bug / feature (the arbiter node "thinks" that its 3.4 and skips cluster time processing) if its fixed it will open this bug. Hence I think its better to backport the fix in advance. |
| Comment by Githook User [ 13/Apr/18 ] |
|
Author: {'email': 'misha@mongodb.com', 'name': 'Misha Tyulenev', 'username': 'mikety'}Message: |
| Comment by Judah Schvimer [ 11/Apr/18 ] |
|
siyuan.zhou, the code you linked prevents a direct reconfig from arbiter to not arbiter. Is it possible to first reconfig a node out of a replica set and then reconfig it in as an arbiter, or reconfig an arbiter out of a replica set and then reconfig it in as a normal node without restart? |
| Comment by Eric Milkie [ 11/Apr/18 ] |
|
Arbiters do have storage (they store the replica set config in the local database) and user writes to local db are allowed on ARBITER state nodes. Not sure if that changes the solution possibilities here. |
| Comment by Siyuan Zhou [ 11/Apr/18 ] |
|
judah.schvimer, I believe you cannot convert a secondary to arbiter without shutting down the server via reconfig. The documentation confirms that.
I'm not aware of places where replica set uses logical clock other than oplog. |
| Comment by Judah Schvimer [ 10/Apr/18 ] |
|
A reconfig cannot change arbiter status unless the node gets added as a new member. That is certainly a case you'll want to test (removing and adding the node back in as a new member). I think you could do that without ever shutting the node down, but something I'm not thinking of may be preventing that. siyuan.zhou, any thoughts? And arbiters can accept writes to non-replicated collections. I don't think arbiters can participate in any causal relations meaningfully, though that may be something to document. I think disabling the clock after reading the config from disk and on receiving the first config via a heartbeat would be fine. |
| Comment by Misha Tyulenev [ 10/Apr/18 ] |
|
Thanks jack.mulrow, I enumerated the questions to be easier to refer. |
| Comment by Jack Mulrow [ 10/Apr/18 ] |
|
I think the high level approach makes sense, I just have a few comments/questions:
Two questions:
judah.schvimer Do you know if we need to worry about either of these?
a. I'm guessing it's fine for arbiters to not return $clusterTime even though other nodes in the set are? Maybe it's worth checking with drivers that this won't mess something up in their protocols for gossiping $clusterTime.
b. Since arbiters don't store the sharding identity document (bc they have no storage), I don't think they will ever have their sharding state enabled. This shouldn't matter if we just disable the clock for arbiters in all cases though.
c. When we disable the logical clock for arbiters, should we also have the keys collection manager stop monitoring? |
| Comment by Misha Tyulenev [ 10/Apr/18 ] |
|
The reason arbiter is disabled in a standalone RS is because it cant get the keys for cryptography because they read from local database. In a sharded cluster keys are stored on the config shard and hence its not required to have locally, and any dataless node can still process the requests. |
| Comment by Judah Schvimer [ 10/Apr/18 ] |
Can you elaborate on why the sharding state being disabled is required? If it's an arbiter, aren't we saying that we always want to disable the logical clock? |