[SERVER-59409] Race between reconfig replication and stepup can cause RSM to be stuck in reporting ReplicaSetNoPrimary Created: 17/Aug/21 Updated: 29/Oct/23 Resolved: 19/Oct/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 5.3.0, 4.4.11, 5.0.4, 5.1.0-rc1 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Jason Chan | Assignee: | Andrew Shuvalov (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Backwards Compatibility: | Major Change | ||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Backport Requested: |
v5.1, v5.0, v4.4
|
||||||||||||||||||||
| Sprint: | Repl 2021-09-06, Repl 2021-09-20 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Linked BF Score: | 152 | ||||||||||||||||||||
| Description |
|
Currently, replica set nodes will report it's configVersion as the setVersion to the RSM for topology management purposes. The TopologyManager tracks the max setVersion it has seen so far, and for any primaries that report a configVersion < maxSetVersion, the RSM will set the status of the node as UNKNOWN because it thinks it's a stale primary. There is an existing race condition where if a user performs a reconfig that bumps the configVersion from V to V + 1 and a new primary is stepped up before it has applied configVersion V + 1, which will cause the replica set to enter a state where the RSM is unable to detect a primary because it thinks the new primary that still reports config version V is stale. Consider the following:
Since we will report that the reconfig failed when we fail to replicate the config to the rest of the nodes. This could prompt users to reissue their reconfig on the new primary. However, this can cause failures in our jstests (our stepdown suites in particular). And also, it sounds like it could be problematic that the RSM becomes out of sync with the actual state with the replica set (and is unable to recover) as there are other components that rely on the RSM. |
| Comments |
| Comment by Andrew Shuvalov (Inactive) [ 02/May/22 ] |
|
Even though this is a significant protocol change it doesn't look to me that this affects the downgrade procedure, the reason: This fixes a bug when a race happens between executing a reconfig command and primary of that shard failover. When the race happens the sender of the command gets stuck because the combination of new values is treated as stale and RSM will wait for a refresh forever. Downgrade means migrating to the world where this race is possible. So follow the normal downgrade procedure and expect that the race, which is very rare, is now possible. |
| Comment by Githook User [ 14/Nov/21 ] |
|
Author: {'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}Message: |
| Comment by Githook User [ 08/Nov/21 ] |
|
Author: {'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}Message: |
| Comment by Githook User [ 07/Nov/21 ] |
|
Author: {'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}Message: |
| Comment by Githook User [ 05/Nov/21 ] |
|
Author: {'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}Message: |
| Comment by Githook User [ 05/Nov/21 ] |
|
Author: {'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}Message: |
| Comment by Githook User [ 05/Nov/21 ] |
|
Author: {'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}Message: |
| Comment by Githook User [ 19/Oct/21 ] |
|
Author: {'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}Message: |
| Comment by Githook User [ 18/Oct/21 ] |
|
Author: {'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}Message: |
| Comment by Githook User [ 18/Oct/21 ] |
|
Author: {'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}Message: |
| Comment by Githook User [ 15/Oct/21 ] |
|
Author: {'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}Message: |
| Comment by Githook User [ 15/Oct/21 ] |
|
Author: {'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}Message: |
| Comment by Andrew Shuvalov (Inactive) [ 15/Oct/21 ] |
|
The Waterfall looks unchanged to me after 9 hours in head and unless there are any objections I'm doing the 5.1 backport now as there are hot BFs attached to that branch, e.g. BF-21948. |
| Comment by Andrew Shuvalov (Inactive) [ 15/Oct/21 ] |
|
Filed |
| Comment by Githook User [ 14/Oct/21 ] |
|
Author: {'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}Message: |
| Comment by Githook User [ 05/Oct/21 ] |
|
Author: {'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}Message: |
| Comment by Githook User [ 01/Oct/21 ] |
|
Author: {'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}Message: |
| Comment by Githook User [ 01/Oct/21 ] |
|
Author: {'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}Message: |
| Comment by Shane Harvey [ 14/Sep/21 ] |
|
I'm a bit surprised that SDAM was written with the wrong order but I'll defer to replication to decide on the proper ordering.
This section may have the answer:
|
| Comment by Lamont Nelson [ 14/Sep/21 ] |
|
This issue isn't technically a bug. The RSM is just following the specification. It seems that the tuple ordering implied by the updateRSFromPrimary routine in the spec is actually the opposite of the replication code. The problem is that (setVersion, electionId) is not monotonically increasing, but (electionId, setVersion) is. The RSM is using the first ordering, and this ticket is an example of why it doesn't work. It makes sense to me that this is the case. The portion of the code (and spec) that is in question is trying to determine if a message is from a stale primary. This is the main purpose of the term in the raft protocol – similar to the proposal number in paxos. I feel that it should be sufficient to only use the term since we are trying to detect a stale primary, but I might not have considered all the use cases. However, in order to make any changes via repl set reconfig we have to talk to a primary so it makes sense that (electionId, setVersion) would be the ordering. All reconfig changes have to happen under the leadership of some primary since we can only run them from the primary. I'm not sure if (setVersion, electionId) monotonicity was ever true for this reason. Either way, this ticket provides a counter example for the latest version of the code. |
| Comment by Shane Harvey [ 14/Sep/21 ] |
|
A spec change starts by creating a DRIVERS ticket followed by a github PR to the specs repo. However, I'm not following what the spec change would be because I thought this issue was a bug in the server itself. The drivers expect (setVersion, electionId) to be monotonically increasing when reported from a primary. Does that expectation no longer hold? What do you propose for the new driver behavior? |
| Comment by Wenbin Zhu [ 09/Sep/21 ] |
|
lamont.nelson Thanks for confirming, I will assign this to Sharding NYC backlog for now. One additional thought: Since this will probably change the RSM Spec, will there be a problem for backward compatibility with v3.2.1 and before because electionId was not guaranteed to be monotonic for those versions? Let me know if you need any help from Repl to make that work, thanks. |
| Comment by Lamont Nelson [ 09/Sep/21 ] |
|
Since it seems like the changes that need to occur within the server are RSM related we can take this ticket if you like. |
| Comment by Lamont Nelson [ 09/Sep/21 ] |
|
This seems like a problem to me. It seems intuitively (to me at least) correct to have the ordering be (term/electionId, version) since term should uniquely identify a primary and the topology changes must occur by talking to the primary. At least within the server we can make the RSM consistent with the repl sub-system. |
| Comment by Wenbin Zhu [ 09/Sep/21 ] |
|
RSM uses the (setVersion, electionId) to detect stale primary where setVersion is compared first, but looks like the electionId is basically the server's term. However server uses (term, version) pair where term is compared first, and then the version (same as setVersion). This means that RSM uses an opposite approach than the server to detect stale primary, which seems off. I think this is because there used to be no guarantee for electionId to be monotonic before v3.2.1. Another interesting finding is that it the Spec has already described the same issue here. However the proposed solution (ignore setVersion from non-primaries) does not seems to be correct because server A and server B can both report as primary, while reporting setVersion 5 and 4 respectively, so ignoring setVersion from non-primaries does not take effect here to prevent the issue. shane.harvey lamont.nelson Could you help confirm if this is a problem with the Spec? |
| Comment by Jason Chan [ 18/Aug/21 ] |
|
Thanks shane.harvey.
We order term first as part of our config consensus protocol so that in scenarios similar to above (when a new node steps up before the reconfig on the old primary is majority committed), the new primary is still considered to have the most "up-to-date config" (with a higher term). If the old primary fails over before replicating configVersionAndTerm (V + 1, T), a new primary can stepUp with config (V, T + 1). When the old primary rejoins the set as a secondary, it knows that (V, T + 1) is a newer config and should fetch it to align with the rest of the cluster. |
| Comment by Lamont Nelson [ 18/Aug/21 ] |
|
Would it be possible to not advertise the new setVersion in the hello response until a majority of the nodes have completed the reconfig? Also it seems strange that a primary can stepup without having applied the latest reconfig. |
| Comment by Shane Harvey [ 17/Aug/21 ] |
|
Yes, I believe this issue would also effect drivers. Drivers expect (setVersion, electionId) to be monotonically increasing.
What is the rationale for this order? Drivers give priority to the setVersion not the term. |
| Comment by Jason Chan [ 17/Aug/21 ] |
I think this is possible since IIUC, RSM follows the same SDAM protocol as drivers. shane.harvey, does this sound like an issue for drivers? |
| Comment by Judah Schvimer [ 17/Aug/21 ] |
|
Is this an issue for drivers in addition to the RSM? |
| Comment by Jason Chan [ 17/Aug/21 ] |
|
Also note: Replication has a history of doing internal reconfigs (like in setFCV), which can cause problems in our stepdown suites |
| Comment by Jason Chan [ 17/Aug/21 ] |
|
cc: lamont.nelson to see if the behavior above is a real issue. Some potential solutions may include:
|