[SERVER-35766] Replication commands sent in candidate's new term can interrupt concurrent vote request Created: 22/Jun/18  Updated: 29/Oct/23  Resolved: 03/Aug/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 3.6.7, 4.0.2, 4.1.2

Type: Bug Priority: Major - P3
Reporter: William Schultz (Inactive) Assignee: Vesselina Ratcheva (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
is related to SERVER-34682 Old primary should vote yes and store... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.0, v3.6
Sprint: Repl 2018-07-16, Repl 2018-07-30, Repl 2018-08-13
Participants:
Linked BF Score: 17

 Description   

When a candidate tries to run for election in a new term, it may try to seek a vote from a current primary in an older term. If it wins the dry run election, it increments its local term and then sends out a vote request in the new term. It is possible that this vote request command ends up running concurrently with another command sent from this candidate; for example, an updatePosition request. We currently don't appear to pause updatePosition requests when we become a candidate, so a poorly timed updatePosition request sent after we become a candidate may get sent to the old primary and cause it to step down, since it is sent with a higher term. If this command triggers a step down while running concurrently with a vote request, it may cause the vote request to fail, causing the candidate to potentially lose that election since if it required the vote of that node. This would mean it may need to wait one more election timeout before it tries to get elected again.

This issue appeared because it seems that we check for generic interruptions in the common command processing codepath. Perhaps we want to make these interrupt checks immune to certain interruption types e.g. InterruptedDueToReplStepDown.



 Comments   
Comment by Githook User [ 07/Aug/18 ]

Author:

{'name': 'Vesselina Ratcheva', 'email': 'vesselina.ratcheva@10gen.com', 'username': 'vessy-mongodb'}

Message: SERVER-35766 Prevent stepdowns from interrupting database commands

(cherry picked from commit 0a8dcd7e6bb85b91eca0d06cd987f8c76cbebd0b)
Branch: v3.6
https://github.com/mongodb/mongo/commit/dff731c8d516a3d4fd57b411ba0600ba51b5b800

Comment by Githook User [ 06/Aug/18 ]

Author:

{'name': 'Vesselina Ratcheva', 'email': 'vesselina.ratcheva@10gen.com', 'username': 'vessy-mongodb'}

Message: SERVER-35766 Prevent stepdowns from interrupting database commands

(cherry picked from commit 0a8dcd7e6bb85b91eca0d06cd987f8c76cbebd0b)
Branch: v4.0
https://github.com/mongodb/mongo/commit/f848cceb5f9ae4e369ad6577987d354361773a96

Comment by Githook User [ 03/Aug/18 ]

Author:

{'username': 'vessy-mongodb', 'name': 'Vesselina Ratcheva', 'email': 'vesselina.ratcheva@10gen.com'}

Message: SERVER-35766 Prevent stepdowns from interrupting database commands
Branch: master
https://github.com/mongodb/mongo/commit/0a8dcd7e6bb85b91eca0d06cd987f8c76cbebd0b

Comment by Siyuan Zhou [ 26/Jul/18 ]

geert.bosch made a good point that we can just use checkForInterruptNoAssert and allow InterruptedDueToReplStepDown and PrimarySteppedDown explicitly. Alternatively, I think isNotMasterError() is the right set of errors we can ignore for interruption.

On master, it seems we only use PrimarySteppedDown. william.schultz, do you think we need to backport this ticket? Election handoff will be backported to 4.0 and 3.6.

Comment by Gregory McKeon (Inactive) [ 26/Jun/18 ]

A possible workaround is to retry requestVotes on the candidate node.

Generated at Thu Feb 08 04:40:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.