[SERVER-42650] Remove stale comments mentioned in the RollbackTest for "transitionToSteadyStateOperations" state. Created: 06/Aug/19  Updated: 27/Oct/23  Resolved: 06/Jan/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Suganthi Mani Assignee: Judah Schvimer
Resolution: Gone away Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-42602 Guarantee that unconditional step dow... Closed
depends on SERVER-45178 Rollback via refetch can cause rollba... Closed
Related
related to SERVER-37390 RollbackTestFixture doesn't need to w... Closed
is related to SERVER-44679 Rollback fuzzer must account for tran... Closed
Participants:

 Description   

The comments mentioned here in "transitionToSteadyStateOperations" state  is no longer valid with the tiebreaker model. To be noted, we can transition to "kSteadyStateOps" only from "kSyncSourceOpsDuringRollback". When the RollbackTest was having arbiter (SERVER-33587), the network topology for state "kSyncSourceOpsDuringRollback" was like

   
   A
 /   \
P1 -  S
 
A- Arbiter, P1 - curPrimary, S- curSecondary which is expected to rollback.

As a result, the comment mentioned here is valid as node S can get vote from arbiter and become primary.

But, with the tiebreaker setup, the network topology for state "kSyncSourceOpsDuringRollback" is like

   
   T
 /   
P1 -  S
 
T- Tiebreaker, P1 - curPrimary, S- curSecondary which is expected to rollback.

Over in this model, there is noway, node S can escape from rollback because we make sure in "kSyncSourceOpsBeforeRollback" state that node S & P1's oplog entries are diverged. And, the node P1 writes an oplog entry in the higher term than node S. So, no way node S can get "yes" vote from node P1 for replSetRequestVote cmd unless the node S has rolled back.
Also, to be noted, when we are executing this RollbackTest transition step, we don't expect any kind of random restarts by fuzzer as random restarts happens only during workload execution phase  and not during state transition phase (as mentioned in SERVER-42602). So, there is a guarantee that, at end of kSyncSourceOpsBeforeRollback node S branch of history is not same as node P1.

This ticket should remove  this stale comment and the if-else check. Also, update the RollbackTest such that the rollbackId check is executed for all cases in transitionToSteadyStateOperations().



 Comments   
Comment by Suganthi Mani [ 06/Jan/20 ]

Marking it as gone away, as the original intention of the ticket (stale comment removal) got fixed by SERVER-37390. And, we have a todo for SERVER-45178. So, SERVER-45178 will take care of removing this else code block (i.e., skipping RBID check for majority read concern is off).

Comment by Suganthi Mani [ 03/Jan/20 ]

FYI, Original stale comment has been removed by SERVER-37390. After, SERVER-37390, w/ & w/o node restarts, we always expect the node to go to rollback state. But, skipping the lastRBID check will be based on whether the node has enabled majority read concern or not. After fixing, SERVER-45178, we can remove if-else code block and check the lastRBID always.

Comment by Andy Schwerin [ 18/Oct/19 ]

Only depends on closed tickets. Unblocking.

Comment by Judah Schvimer [ 08/Aug/19 ]

Hmm, that actually seems like a real bug since it means we can roll back without letting other nodes know. I'm not sure if this could cause problems before since it requires restart, but it almost certainly could with resumable initial sync where the sync source is allowed to restart, and the initial syncing node must find out if the sync source rolls back (by checking the RBID). We need to increment the rollback id before truncating the oplog.

Comment by Suganthi Mani [ 08/Aug/19 ]

I am thinking of below case for enableMajorityReadConcern= false which uses RVR. 

1) syncFixup() // Truncates the oplog.
                   ---> ***Unclean shutdown happens here***
2) incrementRollbackID() // Persists the updated Rollback ID.

When the node restarts, there is a chance we may not go into ROLLBACK state. And, if we plan to check for rollback id in transitionToSteadyStateOperations() for all cases, then it might be a problem.

Comment by Judah Schvimer [ 07/Aug/19 ]

I would rather not change the behavior to enable this test, which is not critical. I agree that we could do it before syncFixUp() though. I don't follow how rollback via refetch (RVR) would lose the rollback id (RBID). The RBID is still persisted for RVR and we're still guaranteeing the node has to go through rollback for the test fixture to make progress. Can you please clarify?

Comment by Suganthi Mani [ 06/Aug/19 ]

quick note, for RTT (enableMajorityReadConcern = true), rollbackID info is persisted in local.system.rolback.id collection before the start of the real rollback process. So, any restarts on rollbacked node refresh the rollback id info from local.system.rolback.id collection at startup.

For rollbackViaRefetch, the incremented rollback id is persisted only after the real rollback step. So, if there is an unclean shutdown happens before persisting the rollback id, the node might lose the latest rollback id info. So, this ticket might be a problem for enableMajorityReadConcern= false.

For rollbackViaRefetch, I am not seeing any good reason for wrapping the incrementRollbackID() call in ON_BLOCK_EXIT. To make the behavior consistent with RTT, we should call incrementRollbackID() before syncFixUp(). judah.schvimer, any thoughts.

Comment by Suganthi Mani [ 06/Aug/19 ]

This ticket should go in after SERVER-42602. Else, the curSecondary may point to the node that is not rolled back (i.e. node P1). As a result, this condition may fail.

Generated at Thu Feb 08 05:01:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.