[SERVER-42602] Guarantee that unconditional step down will not happen due to slow node restarts in rollback_fuzzer_[un]clean_shutdowns suites. Created: 02/Aug/19 Updated: 29/Oct/23 Resolved: 30/Aug/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 4.2.1, 4.3.1 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Suganthi Mani | Assignee: | Suganthi Mani |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Backport Requested: |
v4.2
|
||||||||||||||||||||||||
| Sprint: | Repl 2019-08-26, Repl 2019-09-09 | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Linked BF Score: | 19 | ||||||||||||||||||||||||
| Description |
|
There are 2 kinds of phases in rollback fuzzer test suites.
After the RollbackTest transitions to "transitionToSyncSourceOperationsDuringRollback" state, we break the assumption mentioned here in rollback fuzzer. After the "transitionToSyncSourceOperationsDuringRollback" state, the topology looks like below. [CurSecondary/Node to be rolled back] Once the curSecondary node gets rolled back successfully (i.e) caught up to curPrimary, restarting a curPrimary can result in curSecondary to become the new primary. As a result, during workload execution phase, unconditional step down can happen due to slow planned node restarts (i.e. node restarts taking long time). And, that leads to undesired behavior in rollback_fuzzer_[un]clean_shutdown suites. So, in order to fix the issue, we should have below 2 contracts. 1) During workload execution phase, unconditional step down can happen only due to some transient network issues and not because of slow planned node restarts (i.e. node restarts taking long time). 2) Restarting nodes by rollback fuzzer can change the original primary only if all the 3 nodes are connected. |
| Comments |
| Comment by Githook User [ 31/Aug/19 ] | ||||||||||||||||||
|
Author: {'username': 'smani87', 'email': 'suganthi.mani@mongodb.com', 'name': 'Suganthi Mani'}Message: (cherry picked from commit 6f308bbc6f495da46029b6e6316189a14e7842a3) | ||||||||||||||||||
| Comment by Githook User [ 31/Aug/19 ] | ||||||||||||||||||
|
Author: {'name': 'Benety Goh', 'username': 'benety', 'email': 'benety@mongodb.com'}Message: | ||||||||||||||||||
| Comment by Githook User [ 30/Aug/19 ] | ||||||||||||||||||
|
Author: {'email': 'suganthi.mani@mongodb.com', 'name': 'Suganthi Mani', 'username': 'smani87'}Message: (cherry picked from commit 2cba3bf3640aed0121a05f6396cedadd10a06880) | ||||||||||||||||||
| Comment by Githook User [ 30/Aug/19 ] | ||||||||||||||||||
|
Author: {'name': 'Suganthi Mani', 'username': 'smani87', 'email': 'suganthi.mani@mongodb.com'}Message: | ||||||||||||||||||
| Comment by Suganthi Mani [ 08/Aug/19 ] | ||||||||||||||||||
|
I totally forgot about replSetFreeze cmd, that's really a good solution. transitionToSyncSourceOperationsDuringRollback() can execute {replSetFreeze: 24*60*60 /* 24 hrs */} and we can unfreeze it during transitionToSteadyStateOperations() by running {replSetFreeze: 0}. | ||||||||||||||||||
| Comment by Judah Schvimer [ 07/Aug/19 ] | ||||||||||||||||||
|
I also prefer solution 1. Instead of adding a failpoint though, can we use the replSetFreeze command with an "infinite" or 0 timeout to accomplish the same goal? This makes use of things users could actually do and that we already have which I like. | ||||||||||||||||||
| Comment by Suganthi Mani [ 06/Aug/19 ] | ||||||||||||||||||
|
Thanks william.schultz for the detailed explanation. After discussing with him, we decided to drop solution #2 since reconfig correctness reasoning is not straightforward and adds more complexity to the rollback test fixture. | ||||||||||||||||||
| Comment by Suganthi Mani [ 06/Aug/19 ] | ||||||||||||||||||
|
Currently, we have 2 solutions to enforce the contract. 1) Introduce a new failpoint like "FailCandidateToBecomeElectable" which prevents a node from running election. So, turn this fail point on during "transitionToSyncSourceOperationsDuringRollback" state. As a result, during workload execution phase, any curPrimary restarts after the rollback event (i.e. curSecondary successfully rolled back) will not change the original primary. And, then, turn the failpoint off during "transitionToSteadyStateOperations" state.
2) Other solution is that, setting the priority value of curSecondary to 0 using reconfig cmd which would prevent curSecondary from running election. During "transitionToSyncSourceOperationsDuringRollback" execute the replSetReconfig cmd on primary with the change in the curSecondary priority value to 0. replSetReconfig on curPrimary make sure that the new config is persisted locally. So, any subsequent "replSetRequestVotes" request from curSecondary with previous stale config will fail. And, then reset the priority value of curSecondary to 1 during "transitionToSteadyStateOperations" state. | ||||||||||||||||||
| Comment by William Schultz (Inactive) [ 06/Aug/19 ] | ||||||||||||||||||
|
To follow up on Suganthi's description, below is an attempt to summarize some of our general discussion and reasoning about each state of the RollbackTest fixture. Note that server replication is stopped on the tiebreaker node throughout the whole process, until the final transition to steady state operations. We looked at each state and tried to convince ourselves if shutdowns may or may not cause unconditional stepdowns in RollbackTest. We assume that we always wait for a new, stable primary after shutting down a node. suganthi.mani can verify the reasoning below if anything seems incorrect. If nothing else, hopefully this helps clarify each phase of RollbackTest, for future reference. kSteadyStateOps:
Shutdowns may occur on either node and switch the primary an arbitrary number of times. At the kRollbackOps:
If we shut down either the current primary or secondary, the partition should cause each node to return to the same state since the rollback node (P1) will eventually get re-elected. After this phase we need the oplog of the current primary (the node that will roll back) to be diverged from the sync source. kSyncSourceOpsBeforeRollback:
If we shut down either node, the sync source node (P2) will end up being re-elected as long as we wait for a new primary after each shutdown. The divergent rollback node should never get elected since it is isolated. It will remain either as a stale primary or a secondary with a divergent oplog. After this phase completes we assume the sync source node is primary and has applied some operations in its new term. kSyncSourceOpsDuringRollback: (before rollback completion)
If we shut down the sync source node during this state it will be able to start up and get re-elected eventually since the rollback node has a stale/divergent oplog. If we shut down the rollback node before or during its rollback it will still have a divergent oplog and eventually end up back in rollback again. (after rollback completion)
Now that both nodes have the same oplog, it is possible that either can get elected, so shutdown of either one could lead to arbitrary switching of the primary. Additionally, arbitrarily slow restarts could cause a current primary (e.g. S, if it got elected) to step down unconditionally due to a liveness timeout, since it does not necessarily have the support of the tiebreaker node given the network topology. This is the problem outlined in the ticket description above. kSteadyStateOps:
The original state. |