Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 5.3.0, 4.4.11, 5.0.4, 5.1.0-rc1
Affects Version/s: None
Component/s: None
Labels:
None

Backwards Compatibility:
Major Change
Operating System:
ALL
Backport Requested:

v5.1, v5.0, v4.4
Sprint:
Repl 2021-09-06, Repl 2021-09-20
Linked BF Score:
152
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Currently, replica set nodes will report it's configVersion as the setVersion to the RSM for topology management purposes. The TopologyManager tracks the max setVersion it has seen so far, and for any primaries that report a configVersion < maxSetVersion, the RSM will set the status of the node as UNKNOWN because it thinks it's a stale primary.

There is an existing race condition where if a user performs a reconfig that bumps the configVersion from V to V + 1 and a new primary is stepped up before it has applied configVersion V + 1, which will cause the replica set to enter a state where the RSM is unable to detect a primary because it thinks the new primary that still reports config version V is stale.

Consider the following:

Perform reconfig that bumps (configVersion, term) from (V, T) to (V + 1, T). Secondaries are still on (V, T). RSM sets maxSetVersion to V + 1.
New primary steps up, sets it's own configVersion to (V, T + 1).
Old primary will recognize (V, T + 1) as a newer config than (V + 1, T) since term is given priority when ordering ConfigVersionAndTerm. So the old primary will fetch the newer config (V, T +1) to replace its own config.
RSM will not recognize the new primary as "up-to-date" since it is still reporting setVersion V when maxSetVersion is set to V + 1.
RSM will set the new primary status to UNKNOWN, and report a topology of ReplicaSetNoPrimary.
The RSM and replica set stay out of sync with no way to recover without manual intervention.

Since we will report that the reconfig failed when we fail to replicate the config to the rest of the nodes. This could prompt users to reissue their reconfig on the new primary. However, this can cause failures in our jstests (our stepdown suites in particular). And also, it sounds like it could be problematic that the RSM becomes out of sync with the actual state with the replica set (and is unable to recover) as there are other components that rely on the RSM.

is depended on by

DRIVERS-1954 SDAM should give priority to electionId over setVersion when updating topology

Closed

is related to

SERVER-59484 Catch FailedToSatisfyReadPreference in DDL fsm workloads

Closed

Assignee:: Andrew Shuvalov (Inactive)
Reporter:: Jason Chan
Participants:: Andrew Shuvalov, Githook User, Jason Chan, Judah Schvimer, Lamont Nelson, Shane Harvey, Wenbin Zhu
Votes:: 0 Vote for this issue
Watchers:: 12 Start watching this issue

Created:: Aug 17 2021 06:28:55 PM UTC
Updated:: Oct 29 2023 09:49:27 PM UTC
Resolved:: Oct 19 2021 04:04:54 PM UTC
Confidence Status Last Update:: 14/Sep/21 6:17 PM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates