-
Type:
Bug
-
Resolution: Cannot Reproduce
-
Priority:
Major - P3
-
None
-
Affects Version/s: 3.4.3
-
Component/s: Replication
-
None
-
ALL
-
(copied to CRM)
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Hi,
This was witnessed twice where the events happened in the following order.
1. Primary(A) is down.
2. Secondary(B) detects and starts an election.
3. It wins the dry-run election and on every subsequent election timeout(10s) prints the following. This goes on until its restarted.
[ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election again; already candidate
4. A is started. It performs the election and is primary.
5. B is restarted. On one instance it took a long time to shutdown and in another had to be SIGKILLed.
6. A is shutdown.
7. B detects, starts elections and becomes primary.
Relevant logs from Node A for step 1-4
2017-08-17T05:32:08.534-0700 I REPL [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms 2017-08-17T05:32:08.534-0700 I REPL [ReplicationExecutor] conducting a dry run election to see if we could be elected 2017-08-17T05:32:08.534-0700 I REPL [ReplicationExecutor] VoteRequester(term 8 dry run) received a yes vote from nodeB:27017; response message: { term: 8, voteGranted: true, reason: "", ok: 1.0 } 2017-08-17T05:32:08.534-0700 I REPL [ReplicationExecutor] dry election run succeeded, running for election 2017-08-17T05:32:08.621-0700 I REPL [ReplicationExecutor] VoteRequester(term 9) received a yes vote from node-arbiter:27022; response message: { term: 9, voteGranted: true, reason: "", ok: 1.0 } 2017-08-17T05:32:08.621-0700 I REPL [ReplicationExecutor] election succeeded, assuming primary role in term 9
Relevant logs from Node B for step 1-4
2017-08-17T05:18:32.066-0700 I REPL [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms 2017-08-17T05:18:32.066-0700 I REPL [ReplicationExecutor] conducting a dry run election to see if we could be elected 2017-08-17T05:18:32.068-0700 I REPL [ReplicationExecutor] VoteRequester(term 7 dry run) failed to receive response from nodeA:27017: HostUnreachable: Connection refused 2017-08-17T05:18:32.069-0700 I REPL [ReplicationExecutor] VoteRequester(term 7 dry run) received a yes vote from node-Arbiter:27022; response message: { term: 7, voteGranted: true, reason: "", ok: 1.0 } 2017-08-17T05:18:32.069-0700 I REPL [ReplicationExecutor] dry election run succeeded, running for election 2017-08-17T05:18:42.441-0700 I REPL [ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election again; already candidate 2017-08-17T05:18:53.473-0700 I REPL [ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election again; already candidate 2017-08-17T05:19:04.837-0700 I REPL [ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election again; already candidate ...
Note that, although nodeB didn't log any voteRequester messages, it does increase the election term from 7 to 8. This can be confirmed by the term number used by nodeA for dry-run election, 8.
Assuming that nothing went wrong in starting the election, I believe this loop can happen when a node waits indefinitely for voteRequests after starting a real election. I couldn't find any information related to this.
- is related to
-
SERVER-42055 Only acquire a collection IX lock to write the lastVote document
-
- Closed
-
- related to
-
SERVER-42175 WiredTigerRecordStore::reclaimOplog can block for extended periods of time
-
- Closed
-