[SERVER-64068] Relax requirement on catchup takeover dry-run Created: 01/Mar/22  Updated: 06/Dec/22

Status: Open
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Frederic Vitzikam Assignee: Backlog - Replication Team
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-51100 Make dry-run elections write to lastV... Blocked
Assigned Teams:
Replication
Participants:

 Description   

SERVER-51100 goal is to make dry-run elections write to lastVote like real elections do.
Code was changed to do so but we discovered an issue around catchup takeover dry-runs:
Those require the primary to vote `yes` as per SERVER-29502 (See Epic and attached designed document too, `Rules of Catchup Takeover` in particular). See VoteRequester::Algorithm for current code.
If the primary has disk issues, currently it will not impact its ability to vote in that case (because dry run does not write lastVote) so the takeover will succeed.
With SERVER-51100, the primary will be incapable of voting so no takeover can occur.

Matthew suggests to do the following:
For catchup takeover dry run:

  • Require a vote from the primary (or timeout)
  • If the primary votes `no`: fail
  • If the primary votes `yes`: count that vote as +1 and follow the normal majority of votes rule
  • If the primary times out: count that as +0 and follow the normal majority of votes rule

e.g. If there are three voters, and the third is the primary:

X, X, No : fail

No, No, Yes: fail

Yes, No, Yes: succeed

Yes, No, Timeout: fail

Yes, Yes, Timeout: succeed

This would allow the catchup takeover dry-run to succeed even if the primary has disk issues, unblocking SERVER-51100.

 


Generated at Thu Feb 08 05:59:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.