[SERVER-37255] replSetReconfig with concurrent election can trigger invariant Created: 21/Sep/18  Updated: 29/Oct/23  Resolved: 08/Jan/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.6.11, 4.0.7, 4.1.6
Fix Version/s: 3.6.12, 4.0.8, 4.1.7

Type: Bug Priority: Major - P3
Reporter: William Schultz (Inactive) Assignee: A. Jesse Jiryu Davis
Resolution: Fixed Votes: 0
Labels: SWNA
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Problem/Incident
Related
Operating System: ALL
Backport Requested:
v4.0, v3.6, v3.4
Sprint: Repl 2018-10-22, Repl 2018-11-05, Repl 2019-01-14
Participants:
Linked BF Score: 52

 Description   

If a replSetReconfig runs on a node that is concurrently processing a successful election win, it is possible to trigger this invariant. The ReplicationCoordinatorImpl::_onVoteRequestComplete method is called when the VoteRequester completes. In the case of a successful election, we will print this message, and then proceed to processing the election win. We will reset the VoteRequester and then update our member state to reflect our transition to leader. Before we call _performPostMemberStateUpdateAction, though, we unlock the ReplicationCoordinator mutex. This allows a concurrent reconfig command, currently running ReplicationCoordinatorImpl::_finishReplSetReconfig, to end up cancelling an election before we have transitioned to Leader mode. We call ReplicationCoordinatorImpl::_cancelElectionIfNeeded_inlock when the _voteRequester has been reset, but while we are still in the Candidate role. So, we will not return early here, and will end up hitting the subsequent invariant, since the VoteRequester was already destroyed.



 Comments   
Comment by Githook User [ 22/Mar/19 ]

Author:

{'email': 'jesse@mongodb.com', 'name': 'A. Jesse Jiryu Davis', 'username': 'ajdavis'}

Message: SERVER-37255 Fix invariant when reconfig races with election
Branch: v4.0
https://github.com/mongodb/mongo/commit/d370aed86df0489e8ff7ee308f4faefa54b57c05

Comment by Githook User [ 21/Mar/19 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'username': 'ajdavis', 'email': 'jesse@mongodb.com'}

Message: SERVER-37255 Fix invariant when reconfig races with election
Branch: v3.6
https://github.com/mongodb/mongo/commit/a1c7c798168f13284a486153dde4b335735cea0a

Comment by A. Jesse Jiryu Davis [ 22/Feb/19 ]

Requesting backport: a build failure was found on 4.0.6 due to this invariant.

Comment by Githook User [ 08/Jan/19 ]

Author:

{'username': 'ajdavis', 'email': 'jesse@mongodb.com', 'name': 'A. Jesse Jiryu Davis'}

Message: SERVER-37255 Fix invariant when reconfig races with election
Branch: master
https://github.com/mongodb/mongo/commit/327a6bd87961eb7d3cd2a4cd90170e868adf2112

Comment by William Schultz (Inactive) [ 21/Sep/18 ]

A general solution would ideally address the atomicity issues caused by the fact the ReplicationCoordinatorImpl::_performPostMemberStateUpdateAction method must be called without the ReplicationCoordinator mutex. When we want to update internal repl state with that method, we re-acquire the lock after calling the method, but this seems naturally prone to race conditions. Ideally, the mutex wouldn't need to be released and re-acquired if we are only updating internal repl state from inside _performPostMemberStateUpdateAction i.e. not calling into ReplicationCoordinatorExternalState. Perhaps we could split this method into two i.e. one that explicitly updates internal state and requires the mutex, and one that updates external state, and does not.

Generated at Thu Feb 08 04:45:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.