Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-30727

Secondary stuck in loop of current election

    • Type: Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.4.3
    • Component/s: Replication
    • None
    • ALL

      Hi,

      This was witnessed twice where the events happened in the following order.
      1. Primary(A) is down.
      2. Secondary(B) detects and starts an election.
      3. It wins the dry-run election and on every subsequent election timeout(10s) prints the following. This goes on until its restarted.

      [ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election again; already candidate
      

      4. A is started. It performs the election and is primary.
      5. B is restarted. On one instance it took a long time to shutdown and in another had to be SIGKILLed.
      6. A is shutdown.
      7. B detects, starts elections and becomes primary.

      Relevant logs from Node A for step 1-4

      2017-08-17T05:32:08.534-0700 I REPL     [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
      2017-08-17T05:32:08.534-0700 I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected
      2017-08-17T05:32:08.534-0700 I REPL     [ReplicationExecutor] VoteRequester(term 8 dry run) received a yes vote from nodeB:27017; response message: { term: 8, voteGranted: true, reason: "", ok: 1.0 }
      2017-08-17T05:32:08.534-0700 I REPL     [ReplicationExecutor] dry election run succeeded, running for election
      2017-08-17T05:32:08.621-0700 I REPL     [ReplicationExecutor] VoteRequester(term 9) received a yes vote from node-arbiter:27022; response message: { term: 9, voteGranted: true, reason: "", ok: 1.0 }
      2017-08-17T05:32:08.621-0700 I REPL     [ReplicationExecutor] election succeeded, assuming primary role in term 9
      

      Relevant logs from Node B for step 1-4

      2017-08-17T05:18:32.066-0700 I REPL     [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
      2017-08-17T05:18:32.066-0700 I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected
      2017-08-17T05:18:32.068-0700 I REPL     [ReplicationExecutor] VoteRequester(term 7 dry run) failed to receive response from nodeA:27017: HostUnreachable: Connection refused
      2017-08-17T05:18:32.069-0700 I REPL     [ReplicationExecutor] VoteRequester(term 7 dry run) received a yes vote from node-Arbiter:27022; response message: { term: 7, voteGranted: true, reason: "", ok: 1.0 }
      2017-08-17T05:18:32.069-0700 I REPL     [ReplicationExecutor] dry election run succeeded, running for election
      2017-08-17T05:18:42.441-0700 I REPL     [ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election again; already candidate
      2017-08-17T05:18:53.473-0700 I REPL     [ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election again; already candidate
      2017-08-17T05:19:04.837-0700 I REPL     [ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election again; already candidate
      ...
      

      Note that, although nodeB didn't log any voteRequester messages, it does increase the election term from 7 to 8. This can be confirmed by the term number used by nodeA for dry-run election, 8.

      Assuming that nothing went wrong in starting the election, I believe this loop can happen when a node waits indefinitely for voteRequests after starting a real election. I couldn't find any information related to this.

        1. node5-1.log
          3.36 MB
        2. node5-2.log
          1.24 MB
        3. node5-arb.log
          370 kB

            Assignee:
            william.schultz@mongodb.com Will Schultz
            Reporter:
            shredder12 Shashank Sahni
            Votes:
            0 Vote for this issue
            Watchers:
            19 Start watching this issue

              Created:
              Updated:
              Resolved: