Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-35766

Replication commands sent in candidate's new term can interrupt concurrent vote request

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 3.6.7, 4.0.2, 4.1.2
    • Affects Version/s: None
    • Component/s: Replication
    • None
    • Fully Compatible
    • ALL
    • v4.0, v3.6
    • Repl 2018-07-16, Repl 2018-07-30, Repl 2018-08-13
    • 17

      When a candidate tries to run for election in a new term, it may try to seek a vote from a current primary in an older term. If it wins the dry run election, it increments its local term and then sends out a vote request in the new term. It is possible that this vote request command ends up running concurrently with another command sent from this candidate; for example, an updatePosition request. We currently don't appear to pause updatePosition requests when we become a candidate, so a poorly timed updatePosition request sent after we become a candidate may get sent to the old primary and cause it to step down, since it is sent with a higher term. If this command triggers a step down while running concurrently with a vote request, it may cause the vote request to fail, causing the candidate to potentially lose that election since if it required the vote of that node. This would mean it may need to wait one more election timeout before it tries to get elected again.

      This issue appeared because it seems that we check for generic interruptions in the common command processing codepath. Perhaps we want to make these interrupt checks immune to certain interruption types e.g. InterruptedDueToReplStepDown.

            Assignee:
            vesselina.ratcheva@mongodb.com Vesselina Ratcheva (Inactive)
            Reporter:
            william.schultz@mongodb.com William Schultz
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: