[SERVER-35766] Replication commands sent in candidate's new term can interrupt concurrent vote request Created: 22/Jun/18 Updated: 29/Oct/23 Resolved: 03/Aug/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 3.6.7, 4.0.2, 4.1.2 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | William Schultz (Inactive) | Assignee: | Vesselina Ratcheva (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Backport Requested: |
v4.0, v3.6
|
||||||||||||||||
| Sprint: | Repl 2018-07-16, Repl 2018-07-30, Repl 2018-08-13 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Linked BF Score: | 17 | ||||||||||||||||
| Description |
|
When a candidate tries to run for election in a new term, it may try to seek a vote from a current primary in an older term. If it wins the dry run election, it increments its local term and then sends out a vote request in the new term. It is possible that this vote request command ends up running concurrently with another command sent from this candidate; for example, an updatePosition request. We currently don't appear to pause updatePosition requests when we become a candidate, so a poorly timed updatePosition request sent after we become a candidate may get sent to the old primary and cause it to step down, since it is sent with a higher term. If this command triggers a step down while running concurrently with a vote request, it may cause the vote request to fail, causing the candidate to potentially lose that election since if it required the vote of that node. This would mean it may need to wait one more election timeout before it tries to get elected again. This issue appeared because it seems that we check for generic interruptions in the common command processing codepath. Perhaps we want to make these interrupt checks immune to certain interruption types e.g. InterruptedDueToReplStepDown. |
| Comments |
| Comment by Githook User [ 07/Aug/18 ] |
|
Author: {'name': 'Vesselina Ratcheva', 'email': 'vesselina.ratcheva@10gen.com', 'username': 'vessy-mongodb'}Message: (cherry picked from commit 0a8dcd7e6bb85b91eca0d06cd987f8c76cbebd0b) |
| Comment by Githook User [ 06/Aug/18 ] |
|
Author: {'name': 'Vesselina Ratcheva', 'email': 'vesselina.ratcheva@10gen.com', 'username': 'vessy-mongodb'}Message: (cherry picked from commit 0a8dcd7e6bb85b91eca0d06cd987f8c76cbebd0b) |
| Comment by Githook User [ 03/Aug/18 ] |
|
Author: {'username': 'vessy-mongodb', 'name': 'Vesselina Ratcheva', 'email': 'vesselina.ratcheva@10gen.com'}Message: |
| Comment by Siyuan Zhou [ 26/Jul/18 ] |
|
geert.bosch made a good point that we can just use checkForInterruptNoAssert and allow InterruptedDueToReplStepDown and PrimarySteppedDown explicitly. Alternatively, I think isNotMasterError() is the right set of errors we can ignore for interruption. On master, it seems we only use PrimarySteppedDown. william.schultz, do you think we need to backport this ticket? Election handoff will be backported to 4.0 and 3.6. |
| Comment by Gregory McKeon (Inactive) [ 26/Jun/18 ] |
|
A possible workaround is to retry requestVotes on the candidate node. |