[SERVER-30727] Secondary stuck in loop of current election Created: 18/Aug/17 Updated: 21/May/20 Resolved: 16/Jan/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 3.4.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Shashank Sahni | Assignee: | William Schultz (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Participants: | |||||||||||||
| Case: | (copied to CRM) | ||||||||||||
| Description |
|
Hi, This was witnessed twice where the events happened in the following order.
4. A is started. It performs the election and is primary. Relevant logs from Node A for step 1-4
Relevant logs from Node B for step 1-4
Note that, although nodeB didn't log any voteRequester messages, it does increase the election term from 7 to 8. This can be confirmed by the term number used by nodeA for dry-run election, 8. Assuming that nothing went wrong in starting the election, I believe this loop can happen when a node waits indefinitely for voteRequests after starting a real election. I couldn't find any information related to this. |
| Comments |
| Comment by Matthew Russotto [ 11/Jul/19 ] |
|
This makes sense; WiredTigerRecordStore::reclaimOplog never releases the IX lock on "local" as long as there is work to do. |
| Comment by David Bartley [ 22/Apr/19 ] |
|
I think I figured out what happened. These particular replsets were being run on EBS volumes created from EBS snapshots, and the oplog was also truncated (via replSetResizeOplog). The latter would have caused the background oplog thread to start pruning oplog entries. Obviously, that process takes a lock (IX) on "local", and it appears that reading local.system.replset also happens relatively frequently, which also takes a lock (IX) on "local". With pv1, every time an election is called, the "local.replset.election" file is written, which takes an X lock on "local". I believe that latter operation was being blocked by all the other IX locks on "local", and by the time the "X" lock was acquired, election timers had expired. I suspect if I had waited long enough, the background thread might have finished chewing through the oplog. |
| Comment by David Bartley [ 21/Apr/19 ] |
|
Oh, and the situation we saw this in was a 3-node replset where at least 1 member was down initially. However, even when all members eventually come back up, they're still unable to elect a primary because it looks like 2 out of 3 nodes end up in the wedged state. |
| Comment by David Bartley [ 21/Apr/19 ] |
|
We recently ran into this issue (on 3.4.20). I think the deadlock issue is on the right track, since running currentOp indicates that "replExecDBWorker-1" is stuck waiting for a lock while holding an X lock on "local"; calling "show collections" and "show dbs" also hang, presumably because they can't an IX lock on "local". I could't figure out what was actually holding the lock, but I can try attaching gdb next time and try to dump all thread stacks. I might be misreading currentOp, but I think it's the case that if "waitingForLock" is true, there isn't any indication of which lock an operation is waiting on? i.e. the "locks" document only contain locks held, and not those being waited on? It'd probably be good to surface the lock being waited on as well? |
| Comment by William Schultz (Inactive) [ 16/Jan/18 ] |
|
Closing this for now, since I was unable to reproduce the issue. If it occurs again, we can possibly re-open it and investigate further. |
| Comment by William Schultz (Inactive) [ 18/Dec/17 ] |
|
My suspicion is that there may be a deadlock on node5-2 after the dry run election succeeds, causing the thread that is going to start the real election to hang. Election timeouts keep firing and being serviced by the ReplicationExecutor, causing us to see the "Not running for election" messages repetedly, but the election thread is hung, so we indefinitely remain as a candidate. I have not figured out the root cause yet, but it may be related to how the Status ReplicationCoordinatorExternalStateImpl::storeLocalLastVoteDocument function takes locks. |
| Comment by William Schultz (Inactive) [ 18/Dec/17 ] |
|
shredder12 Is this issue easily reproducible, or has it only been seen rarely? |
| Comment by William Schultz (Inactive) [ 14/Sep/17 ] |
|
I did look into this briefly, but have not come up with a conclusive diagnosis yet. |
| Comment by Shashank Sahni [ 18/Aug/17 ] |
|
Hi Thomas, Thank you for the quick response. PFA logs for the replicaset nodes during the incident. node5-1.log
|
| Comment by Kelsey Schubert [ 18/Aug/17 ] |
|
Hi shredder12, Thanks for reporting this behavior. So we can better understand what is going on here, would you please provide the complete log files for each node in the replica set as well as the replica set configuration, e.g. the output of rs.conf()? Thank you, |