[SERVER-42602] Guarantee that unconditional step down will not happen due to slow node restarts in rollback_fuzzer_[un]clean_shutdowns suites. Created: 02/Aug/19  Updated: 29/Oct/23  Resolved: 30/Aug/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.2.1, 4.3.1

Type: Bug Priority: Major - P3
Reporter: Suganthi Mani Assignee: Suganthi Mani
Resolution: Fixed Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by SERVER-42650 Remove stale comments mentioned in th... Closed
Problem/Incident
Related
is related to SERVER-43237 replSetFreeze and replSetStepDown cmd... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.2
Sprint: Repl 2019-08-26, Repl 2019-09-09
Participants:
Linked BF Score: 19

 Description   

There are 2 kinds of phases in rollback fuzzer test suites.

  1. State Transition Phase - RollbackTest transitions to predefined state before the rollback fuzzer gets into workload execution phase.
  2. Workload Execution Phase - rollback fuzzer executes some list of random commands (including restartNode cmd which can result in change of primary) on the replica set.

After the RollbackTest transitions to "transitionToSyncSourceOperationsDuringRollback" state, we break the assumption mentioned here in rollback fuzzer. After the "transitionToSyncSourceOperationsDuringRollback" state, the topology looks like below.

[CurSecondary/Node to be rolled back]
        |
        |
        |
[CurPrimary]-------- [TieBreakerNode]

Once the curSecondary node gets rolled back successfully (i.e) caught up to curPrimary, restarting a curPrimary can result in curSecondary to become the new primary. As a result, during workload execution phase, unconditional step down can happen due to slow planned node restarts (i.e. node restarts taking long time). And, that  leads to undesired behavior in  rollback_fuzzer_[un]clean_shutdown suites. So, in order to fix the issue, we should have below 2 contracts.

1) During workload execution phase, unconditional step down can happen only due to some transient network issues and not because of slow planned node restarts (i.e. node restarts taking long time).

2) Restarting nodes by rollback fuzzer can change the original primary only if all the 3 nodes are connected. 



 Comments   
Comment by Githook User [ 31/Aug/19 ]

Author:

{'username': 'smani87', 'email': 'suganthi.mani@mongodb.com', 'name': 'Suganthi Mani'}

Message: SERVER-42602 retry replSetFreeze on rollback node in tests if replica set config has not been loaded

(cherry picked from commit 6f308bbc6f495da46029b6e6316189a14e7842a3)
Branch: v4.2
https://github.com/mongodb/mongo/commit/fef676084abb58b23103b03c2ef7cc05c3cfb023

Comment by Githook User [ 31/Aug/19 ]

Author:

{'name': 'Benety Goh', 'username': 'benety', 'email': 'benety@mongodb.com'}

Message: SERVER-42602 retry replSetFreeze on rollback node in tests if replica set config has not been loaded
Branch: master
https://github.com/mongodb/mongo/commit/6f308bbc6f495da46029b6e6316189a14e7842a3

Comment by Githook User [ 30/Aug/19 ]

Author:

{'email': 'suganthi.mani@mongodb.com', 'name': 'Suganthi Mani', 'username': 'smani87'}

Message: SERVER-42602 Guarantees that the unconditional step down does not happen due to slow node restarts in rollback_fuzzer_[un]clean_shutdowns suites.

(cherry picked from commit 2cba3bf3640aed0121a05f6396cedadd10a06880)
Branch: v4.2
https://github.com/mongodb/mongo/commit/fd2ce602d86b858d748ca7c40f2a96a77a1175aa

Comment by Githook User [ 30/Aug/19 ]

Author:

{'name': 'Suganthi Mani', 'username': 'smani87', 'email': 'suganthi.mani@mongodb.com'}

Message: SERVER-42602 Guarantees that the unconditional step down does not happen due to slow node restarts in rollback_fuzzer_[un]clean_shutdowns suites.
Branch: master
https://github.com/mongodb/mongo/commit/2cba3bf3640aed0121a05f6396cedadd10a06880

Comment by Suganthi Mani [ 08/Aug/19 ]

I totally forgot about replSetFreeze cmd, that's really a good solution. transitionToSyncSourceOperationsDuringRollback() can execute {replSetFreeze: 24*60*60 /* 24 hrs */} and we can unfreeze it during transitionToSteadyStateOperations() by running  {replSetFreeze: 0}.

Comment by Judah Schvimer [ 07/Aug/19 ]

I also prefer solution 1. Instead of adding a failpoint though, can we use the replSetFreeze command with an "infinite" or 0 timeout to accomplish the same goal? This makes use of things users could actually do and that we already have which I like.

Comment by Suganthi Mani [ 06/Aug/19 ]

Thanks william.schultz for the detailed explanation. After discussing with him, we decided to drop solution #2 since reconfig correctness reasoning is not straightforward and adds more complexity to the rollback test fixture.

Comment by Suganthi Mani [ 06/Aug/19 ]

Currently, we have 2 solutions to enforce the contract.

1) Introduce a new failpoint like "FailCandidateToBecomeElectable" which prevents a node from running election. So, turn this fail point on during "transitionToSyncSourceOperationsDuringRollback" state. As a result, during workload execution phase, any curPrimary restarts after the rollback event (i.e. curSecondary successfully rolled back) will not change the original primary. And, then, turn the failpoint off during "transitionToSteadyStateOperations" state.

  • This means, we should also make sure when the curSecondary is restarted, we should reactivate the failpoint on. Else, the curSecondary can run for election.

2) Other solution is that, setting the priority value of curSecondary to 0 using reconfig cmd which would prevent curSecondary from running election. During "transitionToSyncSourceOperationsDuringRollback" execute the replSetReconfig cmd on primary with the change in the curSecondary priority value to 0. replSetReconfig  on curPrimary make sure that the new config is persisted locally. So, any subsequent  "replSetRequestVotes" request from curSecondary with previous stale config will fail. And, then reset the priority value of curSecondary to 1 during "transitionToSteadyStateOperations" state.

Comment by William Schultz (Inactive) [ 06/Aug/19 ]

To follow up on Suganthi's description, below is an attempt to summarize some of our general discussion and reasoning about each state of the RollbackTest fixture. Note that server replication is stopped on the tiebreaker node throughout the whole process, until the final transition to steady state operations. We looked at each state and tried to convince ourselves if shutdowns may or may not cause unconditional stepdowns in RollbackTest. We assume that we always wait for a new, stable primary after shutting down a node. suganthi.mani can verify the reasoning below if anything seems incorrect. If nothing else, hopefully this helps clarify each phase of RollbackTest, for future reference.

kSteadyStateOps:

   T
 /   \
P1 -  S

Shutdowns may occur on either node and switch the primary an arbitrary number of times. At the
end of this phase we just need all nodes to be connected and to have some stable primary. We don't
need to worry who the primary is, though.

kRollbackOps:

   T
 /    
P1    S

If we shut down either the current primary or secondary, the partition should cause each node to return to the same state since the rollback node (P1) will eventually get re-elected. After this phase we need the oplog of the current primary (the node that will roll back) to be diverged from the sync source.

kSyncSourceOpsBeforeRollback:

   T
     \
P1    P2

If we shut down either node, the sync source node (P2) will end up being re-elected as long as we wait for a new primary after each shutdown. The divergent rollback node should never get elected since it is isolated. It will remain either as a stale primary or a secondary with a divergent oplog. After this phase completes we assume the sync source node is primary and has applied some operations in its new term.

kSyncSourceOpsDuringRollback:

(before rollback completion)

   T
     \
R  -  P2

If we shut down the sync source node during this state it will be able to start up and get re-elected eventually since the rollback node has a stale/divergent oplog. If we shut down the rollback node before or during its rollback it will still have a divergent oplog and eventually end up back in rollback again.

(after rollback completion)

   T
     \
S  -  P2

Now that both nodes have the same oplog, it is possible that either can get elected, so shutdown of either one could lead to arbitrary switching of the primary. Additionally, arbitrarily slow restarts could cause a current primary (e.g. S, if it got elected) to step down unconditionally due to a liveness timeout, since it does not necessarily have the support of the tiebreaker node given the network topology. This is the problem outlined in the ticket description above.

kSteadyStateOps:

   T
 /   \
S  -  P2

The original state.

Generated at Thu Feb 08 05:00:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.