Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major - P3
-
None
-
3.4.3
-
None
-
ALL
-
(copied to CRM)
Description
Hi,
This was witnessed twice where the events happened in the following order.
1. Primary(A) is down.
2. Secondary(B) detects and starts an election.
3. It wins the dry-run election and on every subsequent election timeout(10s) prints the following. This goes on until its restarted.
[ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election again; already candidate
|
4. A is started. It performs the election and is primary.
5. B is restarted. On one instance it took a long time to shutdown and in another had to be SIGKILLed.
6. A is shutdown.
7. B detects, starts elections and becomes primary.
Relevant logs from Node A for step 1-4
2017-08-17T05:32:08.534-0700 I REPL [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
|
2017-08-17T05:32:08.534-0700 I REPL [ReplicationExecutor] conducting a dry run election to see if we could be elected
|
2017-08-17T05:32:08.534-0700 I REPL [ReplicationExecutor] VoteRequester(term 8 dry run) received a yes vote from nodeB:27017; response message: { term: 8, voteGranted: true, reason: "", ok: 1.0 }
|
2017-08-17T05:32:08.534-0700 I REPL [ReplicationExecutor] dry election run succeeded, running for election
|
2017-08-17T05:32:08.621-0700 I REPL [ReplicationExecutor] VoteRequester(term 9) received a yes vote from node-arbiter:27022; response message: { term: 9, voteGranted: true, reason: "", ok: 1.0 }
|
2017-08-17T05:32:08.621-0700 I REPL [ReplicationExecutor] election succeeded, assuming primary role in term 9
|
Relevant logs from Node B for step 1-4
2017-08-17T05:18:32.066-0700 I REPL [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
|
2017-08-17T05:18:32.066-0700 I REPL [ReplicationExecutor] conducting a dry run election to see if we could be elected
|
2017-08-17T05:18:32.068-0700 I REPL [ReplicationExecutor] VoteRequester(term 7 dry run) failed to receive response from nodeA:27017: HostUnreachable: Connection refused
|
2017-08-17T05:18:32.069-0700 I REPL [ReplicationExecutor] VoteRequester(term 7 dry run) received a yes vote from node-Arbiter:27022; response message: { term: 7, voteGranted: true, reason: "", ok: 1.0 }
|
2017-08-17T05:18:32.069-0700 I REPL [ReplicationExecutor] dry election run succeeded, running for election
|
2017-08-17T05:18:42.441-0700 I REPL [ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election again; already candidate
|
2017-08-17T05:18:53.473-0700 I REPL [ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election again; already candidate
|
2017-08-17T05:19:04.837-0700 I REPL [ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election again; already candidate
|
...
|
Note that, although nodeB didn't log any voteRequester messages, it does increase the election term from 7 to 8. This can be confirmed by the term number used by nodeA for dry-run election, 8.
Assuming that nothing went wrong in starting the election, I believe this loop can happen when a node waits indefinitely for voteRequests after starting a real election. I couldn't find any information related to this.
Attachments
Issue Links
- is related to
-
SERVER-42055 Only acquire a collection IX lock to write the lastVote document
-
- Closed
-
- related to
-
SERVER-42175 WiredTigerRecordStore::reclaimOplog can block for extended periods of time
-
- Closed
-