-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: None
-
None
-
Networking & Observability
-
Fully Compatible
-
ALL
-
Networking & Obs 2024-11-11
-
0
-
None
-
None
-
None
-
None
-
None
-
None
-
None
The reconfig_race_with_failover.js test attempts to create a situation where we have the following:
- existing primary receives a new configuration via manual reconfig (so version is V + 1)
- simultaneously, a secondary is told to step up prior to receiving the V+1 config (thereby resulting in a (V, T+1) version, electionId pair.
The expected behavior is that the secondary is able to successfully step up with a config of (V, T+1) because the original primary's (V + 1, T) config was not committed by the full set yet. The test attempts to achieve this by blocking replication of the config via the skipBeforeFetchingConfig failpoint. This failpoint doesn't fully block config replication though, so in some test runs we see the other secondary receive the (V+1, T) config before the election for the stepping up secondary. The result is that the stepping up secondary loses the election and is unable to step up. A subsequent query performed in the test then fails because there is no primary.
This test is meant to cover the RSM's ability to track this state, rather than validating any replication behavior. Given that, it is likely more precise, simple, and less error prone to rewrite this test as a unit test of the RSM. That way, we won't need to use a complex series of failpoints to reach the intended state being tested.