Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Cannot Reproduce
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 3.4.3
Component/s: Replication
Labels:
None

Operating System:
ALL
Case:
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Hi,

This was witnessed twice where the events happened in the following order.
1. Primary(A) is down.
2. Secondary(B) detects and starts an election.
3. It wins the dry-run election and on every subsequent election timeout(10s) prints the following. This goes on until its restarted.

[ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election again; already candidate

4. A is started. It performs the election and is primary.
5. B is restarted. On one instance it took a long time to shutdown and in another had to be SIGKILLed.
6. A is shutdown.
7. B detects, starts elections and becomes primary.

Relevant logs from Node A for step 1-4

2017-08-17T05:32:08.534-0700 I REPL     [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
2017-08-17T05:32:08.534-0700 I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected
2017-08-17T05:32:08.534-0700 I REPL     [ReplicationExecutor] VoteRequester(term 8 dry run) received a yes vote from nodeB:27017; response message: { term: 8, voteGranted: true, reason: "", ok: 1.0 }
2017-08-17T05:32:08.534-0700 I REPL     [ReplicationExecutor] dry election run succeeded, running for election
2017-08-17T05:32:08.621-0700 I REPL     [ReplicationExecutor] VoteRequester(term 9) received a yes vote from node-arbiter:27022; response message: { term: 9, voteGranted: true, reason: "", ok: 1.0 }
2017-08-17T05:32:08.621-0700 I REPL     [ReplicationExecutor] election succeeded, assuming primary role in term 9

Relevant logs from Node B for step 1-4

2017-08-17T05:18:32.066-0700 I REPL     [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
2017-08-17T05:18:32.066-0700 I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected
2017-08-17T05:18:32.068-0700 I REPL     [ReplicationExecutor] VoteRequester(term 7 dry run) failed to receive response from nodeA:27017: HostUnreachable: Connection refused
2017-08-17T05:18:32.069-0700 I REPL     [ReplicationExecutor] VoteRequester(term 7 dry run) received a yes vote from node-Arbiter:27022; response message: { term: 7, voteGranted: true, reason: "", ok: 1.0 }
2017-08-17T05:18:32.069-0700 I REPL     [ReplicationExecutor] dry election run succeeded, running for election
2017-08-17T05:18:42.441-0700 I REPL     [ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election again; already candidate
2017-08-17T05:18:53.473-0700 I REPL     [ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election again; already candidate
2017-08-17T05:19:04.837-0700 I REPL     [ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election again; already candidate
...

Note that, although nodeB didn't log any voteRequester messages, it does increase the election term from 7 to 8. This can be confirmed by the term number used by nodeA for dry-run election, 8.

Assuming that nothing went wrong in starting the election, I believe this loop can happen when a node waits indefinitely for voteRequests after starting a real election. I couldn't find any information related to this.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

node5-1.log
Aug 18 2017 09:25:50 PM UTC
3.36 MB
Shashank Sahni
node5-2.log
Aug 18 2017 09:25:49 PM UTC
1.24 MB
Shashank Sahni
node5-arb.log
Aug 18 2017 09:25:48 PM UTC
370 kB
Shashank Sahni

is related to

SERVER-42055 Only acquire a collection IX lock to write the lastVote document

Closed

related to

SERVER-42175 WiredTigerRecordStore::reclaimOplog can block for extended periods of time

Closed

Assignee:: Will Schultz
Reporter:: Shashank Sahni
Participants:: David Bartley, Kelsey Schubert, Matthew Russotto, Shashank Sahni, Will Schultz
Votes:: 0 Vote for this issue
Watchers:: 19 Start watching this issue

Created:: Aug 18 2017 12:04:24 AM UTC
Updated:: May 21 2020 11:02:49 PM UTC
Resolved:: Jan 16 2018 03:56:34 PM UTC
Confidence Status Last Update:: 14/Dec/17 5:52 PM

Details

Description

Attachments

Attachments

Issue Links

Forms

Activity

People

Dates