[SERVER-43356] May fail to recover after a rollbackViaRefetch if sync source no longer has required opTime Created: 16/Sep/19  Updated: 06/Dec/22  Resolved: 03/Jan/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Vesselina Ratcheva (Inactive) Assignee: Backlog - Replication Team
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-43867 Work around unrecoverability of rollb... Closed
related to SERVER-38925 Rollback via refetch can cause _id du... Closed
Assigned Teams:
Replication
Operating System: ALL
Participants:
Linked BF Score: 30

 Description   

In order to reach a consistent state after doing a rollbackViaRefetch, the rollback node sets its own minValid to the top of sync source's oplog at the start of recovery. It therefore must be able to replicate and apply up to that lastApplied to complete the recovery. However, if at some point mid-recovery the necessary oplog entries are no longer in the set (either due to a rollback or an unclean shutdown), the rollback node will end up in an unrecoverable state as it may only choose a sync source that has them.



 Comments   
Comment by William Schultz (Inactive) [ 30/Sep/19 ]

For what it's worth, SERVER-38925 was a serious bug that was specifically caught by the rollback_fuzzer_unclean_shutdowns suite when EMRC=false. I believe that it was also exposed by the clean shutdowns suite, but it was made more likely with unclean shutdowns. We might want to consider that before completely removing this suite on EMRC=false.

Generated at Thu Feb 08 05:02:58 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.