-
Type: Bug
-
Resolution: Cannot Reproduce
-
Priority: Major - P3
-
None
-
Affects Version/s: 3.4.3
-
Component/s: Replication
-
None
-
ALL
-
(copied to CRM)
Hi,
This was witnessed twice where the events happened in the following order.
1. Primary(A) is down.
2. Secondary(B) detects and starts an election.
3. It wins the dry-run election and on every subsequent election timeout(10s) prints the following. This goes on until its restarted.
[ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election again; already candidate
4. A is started. It performs the election and is primary.
5. B is restarted. On one instance it took a long time to shutdown and in another had to be SIGKILLed.
6. A is shutdown.
7. B detects, starts elections and becomes primary.
Relevant logs from Node A for step 1-4
2017-08-17T05:32:08.534-0700 I REPL [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms 2017-08-17T05:32:08.534-0700 I REPL [ReplicationExecutor] conducting a dry run election to see if we could be elected 2017-08-17T05:32:08.534-0700 I REPL [ReplicationExecutor] VoteRequester(term 8 dry run) received a yes vote from nodeB:27017; response message: { term: 8, voteGranted: true, reason: "", ok: 1.0 } 2017-08-17T05:32:08.534-0700 I REPL [ReplicationExecutor] dry election run succeeded, running for election 2017-08-17T05:32:08.621-0700 I REPL [ReplicationExecutor] VoteRequester(term 9) received a yes vote from node-arbiter:27022; response message: { term: 9, voteGranted: true, reason: "", ok: 1.0 } 2017-08-17T05:32:08.621-0700 I REPL [ReplicationExecutor] election succeeded, assuming primary role in term 9
Relevant logs from Node B for step 1-4
2017-08-17T05:18:32.066-0700 I REPL [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms 2017-08-17T05:18:32.066-0700 I REPL [ReplicationExecutor] conducting a dry run election to see if we could be elected 2017-08-17T05:18:32.068-0700 I REPL [ReplicationExecutor] VoteRequester(term 7 dry run) failed to receive response from nodeA:27017: HostUnreachable: Connection refused 2017-08-17T05:18:32.069-0700 I REPL [ReplicationExecutor] VoteRequester(term 7 dry run) received a yes vote from node-Arbiter:27022; response message: { term: 7, voteGranted: true, reason: "", ok: 1.0 } 2017-08-17T05:18:32.069-0700 I REPL [ReplicationExecutor] dry election run succeeded, running for election 2017-08-17T05:18:42.441-0700 I REPL [ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election again; already candidate 2017-08-17T05:18:53.473-0700 I REPL [ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election again; already candidate 2017-08-17T05:19:04.837-0700 I REPL [ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election again; already candidate ...
Note that, although nodeB didn't log any voteRequester messages, it does increase the election term from 7 to 8. This can be confirmed by the term number used by nodeA for dry-run election, 8.
Assuming that nothing went wrong in starting the election, I believe this loop can happen when a node waits indefinitely for voteRequests after starting a real election. I couldn't find any information related to this.
- is related to
-
SERVER-42055 Only acquire a collection IX lock to write the lastVote document
- Closed
- related to
-
SERVER-42175 WiredTigerRecordStore::reclaimOplog can block for extended periods of time
- Closed