-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Replication
-
None
-
Fully Compatible
-
ALL
-
v4.2
-
Repl 2019-08-26, Repl 2019-09-09
-
19
There are 2 kinds of phases in rollback fuzzer test suites.
- State Transition Phase - RollbackTest transitions to predefined state before the rollback fuzzer gets into workload execution phase.
- Workload Execution Phase - rollback fuzzer executes some list of random commands (including restartNode cmd which can result in change of primary) on the replica set.
After the RollbackTest transitions to "transitionToSyncSourceOperationsDuringRollback" state, we break the assumption mentioned here in rollback fuzzer. After the "transitionToSyncSourceOperationsDuringRollback" state, the topology looks like below.
[CurSecondary/Node to be rolled back]
|
|
|
[CurPrimary]-------- [TieBreakerNode]
Once the curSecondary node gets rolled back successfully (i.e) caught up to curPrimary, restarting a curPrimary can result in curSecondary to become the new primary. As a result, during workload execution phase, unconditional step down can happen due to slow planned node restarts (i.e. node restarts taking long time). And, that leads to undesired behavior in rollback_fuzzer_[un]clean_shutdown suites. So, in order to fix the issue, we should have below 2 contracts.
1) During workload execution phase, unconditional step down can happen only due to some transient network issues and not because of slow planned node restarts (i.e. node restarts taking long time).
2) Restarting nodes by rollback fuzzer can change the original primary only if all the 3 nodes are connected.
- is depended on by
-
SERVER-42650 Remove stale comments mentioned in the RollbackTest for "transitionToSteadyStateOperations" state.
- Closed
- is related to
-
SERVER-43237 replSetFreeze and replSetStepDown cmd done part of restartNode()/transitionToSteadyStateOperations() in rollback test should be resilient of network error.
- Closed