[SERVER-30727] Secondary stuck in loop of current election Created: 18/Aug/17  Updated: 21/May/20  Resolved: 16/Jan/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.4.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Shashank Sahni Assignee: William Schultz (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File node5-1.log     Text File node5-2.log     Text File node5-arb.log    
Issue Links:
Related
related to SERVER-42175 WiredTigerRecordStore::reclaimOplog c... Closed
is related to SERVER-42055 Only acquire a collection IX lock to ... Closed
Operating System: ALL
Participants:
Case:

 Description   

Hi,

This was witnessed twice where the events happened in the following order.
1. Primary(A) is down.
2. Secondary(B) detects and starts an election.
3. It wins the dry-run election and on every subsequent election timeout(10s) prints the following. This goes on until its restarted.

[ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election again; already candidate

4. A is started. It performs the election and is primary.
5. B is restarted. On one instance it took a long time to shutdown and in another had to be SIGKILLed.
6. A is shutdown.
7. B detects, starts elections and becomes primary.

Relevant logs from Node A for step 1-4

2017-08-17T05:32:08.534-0700 I REPL     [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
2017-08-17T05:32:08.534-0700 I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected
2017-08-17T05:32:08.534-0700 I REPL     [ReplicationExecutor] VoteRequester(term 8 dry run) received a yes vote from nodeB:27017; response message: { term: 8, voteGranted: true, reason: "", ok: 1.0 }
2017-08-17T05:32:08.534-0700 I REPL     [ReplicationExecutor] dry election run succeeded, running for election
2017-08-17T05:32:08.621-0700 I REPL     [ReplicationExecutor] VoteRequester(term 9) received a yes vote from node-arbiter:27022; response message: { term: 9, voteGranted: true, reason: "", ok: 1.0 }
2017-08-17T05:32:08.621-0700 I REPL     [ReplicationExecutor] election succeeded, assuming primary role in term 9

Relevant logs from Node B for step 1-4

2017-08-17T05:18:32.066-0700 I REPL     [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
2017-08-17T05:18:32.066-0700 I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected
2017-08-17T05:18:32.068-0700 I REPL     [ReplicationExecutor] VoteRequester(term 7 dry run) failed to receive response from nodeA:27017: HostUnreachable: Connection refused
2017-08-17T05:18:32.069-0700 I REPL     [ReplicationExecutor] VoteRequester(term 7 dry run) received a yes vote from node-Arbiter:27022; response message: { term: 7, voteGranted: true, reason: "", ok: 1.0 }
2017-08-17T05:18:32.069-0700 I REPL     [ReplicationExecutor] dry election run succeeded, running for election
2017-08-17T05:18:42.441-0700 I REPL     [ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election again; already candidate
2017-08-17T05:18:53.473-0700 I REPL     [ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election again; already candidate
2017-08-17T05:19:04.837-0700 I REPL     [ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election again; already candidate
...

Note that, although nodeB didn't log any voteRequester messages, it does increase the election term from 7 to 8. This can be confirmed by the term number used by nodeA for dry-run election, 8.

Assuming that nothing went wrong in starting the election, I believe this loop can happen when a node waits indefinitely for voteRequests after starting a real election. I couldn't find any information related to this.



 Comments   
Comment by Matthew Russotto [ 11/Jul/19 ]

This makes sense; WiredTigerRecordStore::reclaimOplog never releases the IX lock on "local" as long as there is work to do.

Comment by David Bartley [ 22/Apr/19 ]

I think I figured out what happened. These particular replsets were being run on EBS volumes created from EBS snapshots, and the oplog was also truncated (via replSetResizeOplog). The latter would have caused the background oplog thread to start pruning oplog entries. Obviously, that process takes a lock (IX) on "local", and it appears that reading local.system.replset also happens relatively frequently, which also takes a lock (IX) on "local". With pv1, every time an election is called, the "local.replset.election" file is written, which takes an X lock on "local". I believe that latter operation was being blocked by all the other IX locks on "local", and by the time the "X" lock was acquired, election timers had expired. I suspect if I had waited long enough, the background thread might have finished chewing through the oplog.

Comment by David Bartley [ 21/Apr/19 ]

Oh, and the situation we saw this in was a 3-node replset where at least 1 member was down initially. However, even when all members eventually come back up, they're still unable to elect a primary because it looks like 2 out of 3 nodes end up in the wedged state.

Comment by David Bartley [ 21/Apr/19 ]

We recently ran into this issue (on 3.4.20). I think the deadlock issue is on the right track, since running currentOp indicates that "replExecDBWorker-1" is stuck waiting for a lock while holding an X lock on "local"; calling "show collections" and "show dbs" also hang, presumably because they can't an IX lock on "local". I could't figure out what was actually holding the lock, but I can try attaching gdb next time and try to dump all thread stacks.

I might be misreading currentOp, but I think it's the case that if "waitingForLock" is true, there isn't any indication of which lock an operation is waiting on? i.e. the "locks" document only contain locks held, and not those being waited on? It'd probably be good to surface the lock being waited on as well?

Comment by William Schultz (Inactive) [ 16/Jan/18 ]

Closing this for now, since I was unable to reproduce the issue. If it occurs again, we can possibly re-open it and investigate further.

Comment by William Schultz (Inactive) [ 18/Dec/17 ]

My suspicion is that there may be a deadlock on node5-2 after the dry run election succeeds, causing the thread that is going to start the real election to hang. Election timeouts keep firing and being serviced by the ReplicationExecutor, causing us to see the "Not running for election" messages repetedly, but the election thread is hung, so we indefinitely remain as a candidate. I have not figured out the root cause yet, but it may be related to how the Status ReplicationCoordinatorExternalStateImpl::storeLocalLastVoteDocument function takes locks.

Comment by William Schultz (Inactive) [ 18/Dec/17 ]

shredder12 Is this issue easily reproducible, or has it only been seen rarely?

Comment by William Schultz (Inactive) [ 14/Sep/17 ]

I did look into this briefly, but have not come up with a conclusive diagnosis yet.

Comment by Shashank Sahni [ 18/Aug/17 ]

Hi Thomas,

Thank you for the quick response. PFA logs for the replicaset nodes during the incident. node5-1.log node5-2.log node5-arb.log

  • node5-1 aka nodeA
  • node5-2 aka nodeB
  • node5-arb - the arbiter of the replicaset.
Comment by Kelsey Schubert [ 18/Aug/17 ]

Hi shredder12,

Thanks for reporting this behavior. So we can better understand what is going on here, would you please provide the complete log files for each node in the replica set as well as the replica set configuration, e.g. the output of rs.conf()?

Thank you,
Thomas

Generated at Thu Feb 08 04:24:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.