Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 4.2.1, 4.3.1
Affects Version/s: None
Component/s: Replication
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v4.2
Sprint:
Repl 2019-08-26, Repl 2019-09-09
Linked BF Score:
19
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

There are 2 kinds of phases in rollback fuzzer test suites.

State Transition Phase - RollbackTest transitions to predefined state before the rollback fuzzer gets into workload execution phase.
Workload Execution Phase - rollback fuzzer executes some list of random commands (including restartNode cmd which can result in change of primary) on the replica set.

After the RollbackTest transitions to "transitionToSyncSourceOperationsDuringRollback" state, we break the assumption mentioned here in rollback fuzzer. After the "transitionToSyncSourceOperationsDuringRollback" state, the topology looks like below.

[CurSecondary/Node to be rolled back]
|
|
|
[CurPrimary]-------- [TieBreakerNode]

Once the curSecondary node gets rolled back successfully (i.e) caught up to curPrimary, restarting a curPrimary can result in curSecondary to become the new primary. As a result, during workload execution phase, unconditional step down can happen due to slow planned node restarts (i.e. node restarts taking long time). And, that leads to undesired behavior in rollback_fuzzer_[un]clean_shutdown suites. So, in order to fix the issue, we should have below 2 contracts.

1) During workload execution phase, unconditional step down can happen only due to some transient network issues and not because of slow planned node restarts (i.e. node restarts taking long time).

2) Restarting nodes by rollback fuzzer can change the original primary only if all the 3 nodes are connected.

is depended on by

SERVER-42650 Remove stale comments mentioned in the RollbackTest for "transitionToSteadyStateOperations" state.

Closed

is related to

SERVER-43237 replSetFreeze and replSetStepDown cmd done part of restartNode()/transitionToSteadyStateOperations() in rollback test should be resilient of network error.

Closed

Assignee:: Suganthi Mani
Reporter:: Suganthi Mani
Participants:: Githook User, Judah Schvimer, Suganthi Mani, Will Schultz
Votes:: 1 Vote for this issue
Watchers:: 5 Start watching this issue

Created:: Aug 02 2019 06:39:37 AM UTC
Updated:: Oct 29 2023 10:18:26 PM UTC
Resolved:: Aug 30 2019 05:18:43 PM UTC
Confidence Status Last Update:: 23/Aug/19 10:10 PM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates