[SERVER-46850] Stepup due to election handoff may not have _voteRequester when it's candidate Created: 13/Mar/20  Updated: 18/May/20  Resolved: 18/May/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.6.17, 4.2.4, 4.0.16, 4.4.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Siyuan Zhou Assignee: Siyuan Zhou
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
duplicates SERVER-48256 Election cannot be canceled when writ... Closed
Operating System: ALL
Sprint: Repl 2020-03-23, Repl 2020-04-20, Repl 2020-05-04, Repl 2020-05-18, Repl 2020-06-01
Participants:
Linked BF Score: 0

 Description   

ReplicationCoordinatorImpl::_cancelElectionIfNeeded_inlock() assumes a candidate always has a _voteRequester, which isn't true for stepup command that skips dry run.

Stepup command due to election handoff skips dry run and writes down the vote for itself after changing its role to kCandidate, then releases the mutex. Another thread calling _cancelElectionIfNeeded_inlock() will hit the invariant if it runs in this window until the node finishes the write and starts _voteRequester.



 Comments   
Comment by Siyuan Zhou [ 18/May/20 ]

I don't thinkĀ SERVER-48256 will be backported back to 3.6. Since the original bug in BF is a rare race between election and transitioning to secondary, I'd propose not backporting the fix to 4.2 and earlier versions until it's reported.

Comment by Siyuan Zhou [ 18/May/20 ]

Even if we skip vote request when it's null, the election isn't cancelled. This issue will be fixed by SERVER-48256. Closing this as a dup.

Generated at Thu Feb 08 05:12:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.