[SERVER-59409] Race between reconfig replication and stepup can cause RSM to be stuck in reporting ReplicaSetNoPrimary Created: 17/Aug/21  Updated: 29/Oct/23  Resolved: 19/Oct/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 5.3.0, 4.4.11, 5.0.4, 5.1.0-rc1

Type: Bug Priority: Major - P3
Reporter: Jason Chan Assignee: Andrew Shuvalov (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by DRIVERS-1954 SDAM should give priority to election... Closed
Related
is related to SERVER-59484 Catch FailedToSatisfyReadPreference i... Closed
Backwards Compatibility: Major Change
Operating System: ALL
Backport Requested:
v5.1, v5.0, v4.4
Sprint: Repl 2021-09-06, Repl 2021-09-20
Participants:
Linked BF Score: 152

 Description   

Currently, replica set nodes will report it's configVersion as the setVersion to the RSM for topology management purposes. The TopologyManager tracks the max setVersion it has seen so far, and for any primaries that report a configVersion < maxSetVersion, the RSM will set the status of the node as UNKNOWN because it thinks it's a stale primary.

There is an existing race condition where if a user performs a reconfig that bumps the configVersion from V to V + 1 and a new primary is stepped up before it has applied configVersion V + 1, which will cause the replica set to enter a state where the RSM is unable to detect a primary because it thinks the new primary that still reports config version V is stale.

Consider the following:

  1. Perform reconfig that bumps (configVersion, term) from (V, T) to (V + 1, T). Secondaries are still on (V, T). RSM sets maxSetVersion to V + 1.
  2. New primary steps up, sets it's own configVersion to (V, T + 1). 
  3. Old primary will recognize (V, T + 1) as a newer config than (V + 1, T) since term is given priority when ordering ConfigVersionAndTerm. So the old primary will fetch the newer config (V, T +1) to replace its own config.
  4. RSM will not recognize the new primary as "up-to-date" since it is still reporting setVersion V when maxSetVersion is set to V + 1.
  5. RSM will set the new primary status to UNKNOWN, and report a topology of ReplicaSetNoPrimary. 
    The RSM and replica set stay out of sync with no way to recover without manual intervention.

Since we will report that the reconfig failed when we fail to replicate the config to the rest of the nodes. This could prompt users to reissue their reconfig on the new primary. However, this can cause failures in our jstests (our stepdown suites in particular). And also, it sounds like it could be problematic that the RSM becomes out of sync with the actual state with the replica set (and is unable to recover) as there are other components that rely on the RSM. 



 Comments   
Comment by Andrew Shuvalov (Inactive) [ 02/May/22 ]

Even though this is a significant protocol change it doesn't look to me that this affects the downgrade procedure, the reason:

This fixes a bug when a race happens between executing a reconfig command and primary of that shard failover. When the race happens the sender of the command gets stuck because the combination of new values is treated as stale and RSM will wait for a refresh forever.

Downgrade means migrating to the world where this race is possible. So follow the normal downgrade procedure and expect that the race, which is very rare, is now possible.

Comment by Githook User [ 14/Nov/21 ]

Author:

{'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}

Message: SERVER-61466 SERVER-59409 BACKPORT-10659 Fix race between reconfig replication and stepup
Branch: v4.4
https://github.com/mongodb/mongo/commit/18f93bfb7a2931c474aa8ee9dae990a0d5ec8442

Comment by Githook User [ 08/Nov/21 ]

Author:

{'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}

Message: SERVER-59409 Reverted: Fix race between reconfig replication and stepup
Branch: v4.4
https://github.com/mongodb/mongo/commit/b701620a0666b27f71468981fa840557d810a2c2

Comment by Githook User [ 07/Nov/21 ]

Author:

{'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}

Message: SERVER-59409 Fix race between reconfig replication and stepup
Branch: v4.4
https://github.com/mongodb/mongo/commit/e6bcbca509e5bd789de0184fc422a69845b8dfe7

Comment by Githook User [ 05/Nov/21 ]

Author:

{'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}

Message: SERVER-59409 BACKPORT-10659 part 3, make the ElectionIdSetVersionPair comparisons tidy
Branch: v4.4
https://github.com/mongodb/mongo/commit/8e4cec9a8b8dcf41eb5bcfe2824e60c4633667b4

Comment by Githook User [ 05/Nov/21 ]

Author:

{'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}

Message: SERVER-59409 BACKPORT-10659 part 2 no op change in tests to require term and Set version
Branch: v4.4
https://github.com/mongodb/mongo/commit/3ad825a4e3ae253c87d3adb25c95627724d03f90

Comment by Githook User [ 05/Nov/21 ]

Author:

{'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}

Message: SERVER-59409 BACKPORT-10659 part 1, add class ElectionIdSetVersionPair
Branch: v4.4
https://github.com/mongodb/mongo/commit/a7b1936040a8ff2c66d73c8317e6b947a193c70a

Comment by Githook User [ 19/Oct/21 ]

Author:

{'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}

Message: SERVER-59409 BACKPORT-10658 Fix race between reconfig replication and stepup parts 3 4
Branch: v5.0
https://github.com/mongodb/mongo/commit/d02754325ef96fed4bc926ba22b03479bb5c6b83

Comment by Githook User [ 18/Oct/21 ]

Author:

{'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}

Message: SERVER-59409 BACKPORT-10658 Fix race between reconfig replication and stepup part 2
Branch: v5.0
https://github.com/mongodb/mongo/commit/39530228b5bbffb13e6c3213d8b2583476b5098e

Comment by Githook User [ 18/Oct/21 ]

Author:

{'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}

Message: SERVER-59409 BACKPORT-10658 Fix race between reconfig replication and stepup part 1
Branch: v5.0
https://github.com/mongodb/mongo/commit/aee30405dfc8be4e69f37eb57b1178f73ec9fc1d

Comment by Githook User [ 15/Oct/21 ]

Author:

{'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}

Message: SERVER-59409 backport BACKPORT-10657: Fix race between reconfig replication and stepup part 4
Branch: v5.1
https://github.com/mongodb/mongo/commit/b996dfbc5135e94b38fb86014c656930cdc2f533

Comment by Githook User [ 15/Oct/21 ]

Author:

{'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}

Message: SERVER-59409 backport part 3, make the ElectionIdSetVersionPair comparisons tidy
Branch: v5.1
https://github.com/mongodb/mongo/commit/b69563e8b35f91441bc6b666cba1c09414b27a08

Comment by Andrew Shuvalov (Inactive) [ 15/Oct/21 ]

The Waterfall looks unchanged to me after 9 hours in head and unless there are any objections I'm doing the 5.1 backport now as there are hot BFs attached to that branch, e.g. BF-21948.

Comment by Andrew Shuvalov (Inactive) [ 15/Oct/21 ]

Filed DRIVERS-1954 for SDAM spec change.

Comment by Githook User [ 14/Oct/21 ]

Author:

{'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}

Message: SERVER-59409 Fix race between reconfig replication and stepup
Branch: master
https://github.com/mongodb/mongo/commit/565f818c1277998b9712fe2c9fc8c6d6158df07b

Comment by Githook User [ 05/Oct/21 ]

Author:

{'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}

Message: SERVER-59409 part 3, make the ElectionIdSetVersionPair comparisons tidy
Branch: master
https://github.com/mongodb/mongo/commit/f6ef384450a3e562a8cb0607b1f50c85c9c47e3c

Comment by Githook User [ 01/Oct/21 ]

Author:

{'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}

Message: SERVER-59409 part 1, add class ElectionIdSetVersionPair to be used in TopologyStateMachine later
Branch: master
https://github.com/mongodb/mongo/commit/3a5fdbfcf053b1b165a474705639e0ef0fe28b88

Comment by Githook User [ 01/Oct/21 ]

Author:

{'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}

Message: SERVER-59409 part 2 no op change in tests to require term and Set version in Hello
Branch: master
https://github.com/mongodb/mongo/commit/a4d9afb865514b69dfd03ae6867fc416f2b7f708

Comment by Shane Harvey [ 14/Sep/21 ]

I'm a bit surprised that SDAM was written with the wrong order but I'll defer to replication to decide on the proper ordering.

I feel that it should be sufficient to only use the term since we are trying to detect a stale primary

This section may have the answer:

Replica set members running MongoDB 2.6.10+ or 3.0+ include an integer called "setVersion" and an ObjectId called "electionId" in their hello or legacy hello response. Starting with MongoDB 3.2.0, replica sets can use two different replication protocol versions; electionIds from one protocol version must not be compared to electionIds from a different protocol.

Because protocol version changes require replica set reconfiguration, clients use the tuple (setVersion, electionId) to detect stale primaries.

https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst#using-setversion-and-electionid-to-detect-stale-primaries

Comment by Lamont Nelson [ 14/Sep/21 ]

This issue isn't technically a bug. The RSM is just following the specification. It seems that the tuple ordering implied by the updateRSFromPrimary routine in the spec is actually the opposite of the replication code. The problem is that (setVersion, electionId) is not monotonically increasing, but (electionId, setVersion) is. The RSM is using the first ordering, and this ticket is an example of why it doesn't work. It makes sense to me that this is the case. The portion of the code (and spec) that is in question is trying to determine if a message is from a stale primary. This is the main purpose of the term in the raft protocol – similar to the proposal number in paxos. I feel that it should be sufficient to only use the term since we are trying to detect a stale primary, but I might not have considered all the use cases. However, in order to make any changes via repl set reconfig we have to talk to a primary so it makes sense that (electionId, setVersion) would be the ordering. All reconfig changes have to happen under the leadership of some primary since we can only run them from the primary. I'm not sure if (setVersion, electionId) monotonicity was ever true for this reason. Either way, this ticket provides a counter example for the latest version of the code.

Comment by Shane Harvey [ 14/Sep/21 ]

A spec change starts by creating a DRIVERS ticket followed by a github PR to the specs repo. However, I'm not following what the spec change would be because I thought this issue was a bug in the server itself. The drivers expect (setVersion, electionId) to be monotonically increasing when reported from a primary. Does that expectation no longer hold? What do you propose for the new driver behavior?

Comment by Wenbin Zhu [ 09/Sep/21 ]

lamont.nelson Thanks for confirming, I will assign this to Sharding NYC backlog for now.

One additional thought: Since this will probably change the RSM Spec, will there be a problem for backward compatibility with v3.2.1 and before because electionId was not guaranteed to be monotonic for those versions? Let me know if you need any help from Repl to make that work, thanks.

Comment by Lamont Nelson [ 09/Sep/21 ]

Since it seems like the changes that need to occur within the server are RSM related we can take this ticket if you like.

Comment by Lamont Nelson [ 09/Sep/21 ]

This seems like a problem to me. It seems intuitively (to me at least) correct to have the ordering be (term/electionId, version) since term should uniquely identify a primary and the topology changes must occur by talking to the primary. At least within the server we can make the RSM consistent with the repl sub-system.

Comment by Wenbin Zhu [ 09/Sep/21 ]

RSM uses the (setVersion, electionId) to detect stale primary where setVersion is compared first, but looks like the electionId is basically the server's term. However server uses (term, version) pair where term is compared first, and then the version (same as setVersion). This means that RSM uses an opposite approach than the server to detect stale primary, which seems off. I think this is because there used to be no guarantee for electionId to be monotonic before v3.2.1.

Another interesting finding is that it the Spec has already described the same issue here. However the proposed solution (ignore setVersion from non-primaries) does not seems to be correct because server A and server B can both report as primary, while reporting setVersion 5 and 4 respectively, so ignoring setVersion from non-primaries does not take effect here to prevent the issue.  shane.harvey lamont.nelson Could you help confirm if this is a problem with the Spec?

Comment by Jason Chan [ 18/Aug/21 ]

Thanks shane.harvey.

What is the rationale for this order? Drivers give priority to the setVersion not the term.

We order term first as part of our config consensus protocol so that in scenarios similar to above (when a new node steps up before the reconfig on the old primary is majority committed), the new primary is still considered to have the most "up-to-date config" (with a higher term). If the old primary fails over before replicating configVersionAndTerm (V + 1, T), a new primary can stepUp with config (V, T + 1). When the old primary rejoins the set as a secondary, it knows that (V, T + 1) is a newer config and should fetch it to align with the rest of the cluster.

Comment by Lamont Nelson [ 18/Aug/21 ]

Would it be possible to not advertise the new setVersion in the hello response until a majority of the nodes have completed the reconfig?

Also it seems strange that a primary can stepup without having applied the latest reconfig.

Comment by Shane Harvey [ 17/Aug/21 ]

Yes, I believe this issue would also effect drivers. Drivers expect (setVersion, electionId) to be monotonically increasing.

Old primary will recognize (V, T + 1) as a newer config than (V + 1, T) since term is given priority when ordering ConfigVersionAndTerm.

What is the rationale for this order? Drivers give priority to the setVersion not the term.

Comment by Jason Chan [ 17/Aug/21 ]

Is this an issue for drivers in addition to the RSM?

I think this is possible since IIUC, RSM follows the same SDAM protocol as drivers. shane.harvey, does this sound like an issue for drivers?

Comment by Judah Schvimer [ 17/Aug/21 ]

Is this an issue for drivers in addition to the RSM?

Comment by Jason Chan [ 17/Aug/21 ]

Also note: Replication has a history of doing internal reconfigs (like in setFCV), which can cause problems in our stepdown suites

Comment by Jason Chan [ 17/Aug/21 ]

cc: lamont.nelson to see if the behavior above is a real issue.

Some potential solutions may include:

  • Having the RSM take into account the term value when calculating maxSetVersion
  • Have replica set nodes also increment the config version when stepping up
  • Only update maxSetVersion once a majority of replica set nodes report the new version
Generated at Thu Feb 08 05:47:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.