[SERVER-11564] replset reconfigs trigger elections sometimes in some cases in which they shouldn't Created: 04/Nov/13  Updated: 10/Dec/14  Resolved: 10/Oct/14

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.4.6, 2.5.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Matt Dannenberg Assignee: Spencer Brody (Inactive)
Resolution: Duplicate Votes: 1
Labels: elections
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-15160 TopologyCoordinatorImpl should not al... Closed
Tested
Operating System: ALL
Steps To Reproduce:

create a 5 node replicaset with nodes as follows:
1) priority: 3
2) priority: 2
3) priority: 0, hidden: true
4) arbiter
5) arbiter

preform a reconfig that increases the priority of node 1
this will trigger an unnecessary election much of the time
if it doesn't change it back and it may
if that doesn't that change it again
repeat as needed, which shouldn't be many times (it was happening about 4 out of 5 times for me)

Participants:

 Description   

During a reconfig primary is relinquished and then recovered almost immediately. This is normal and fine.

Sometimes however an election occurs and it takes 6-8 seconds before the primary resumes being primary. In these problem cases, the primary usually relinquishes because "[rsMgr] can't see a majority of the set, relinquishing primary." This comes from msgCheckNewState, which is run when we receive a heartbeat response.

I believe this happens due to a race in shutting down our ReplSetHealthPollTasks because halting a task only prevents it from running again. This means that if we are waiting on a heartbeat response when we halt the task we will still process that response before the task truly ends.

It may make sense to change the ReplSetHealthPollTask to be a BackgroundJob rather than a Task as stopping a BackgroundJob is more robust and we can wait for the job to actually finish. But this seems like a nontrivial change and probably cannot be gracefully backported to 2.4



 Comments   
Comment by Spencer Brody (Inactive) [ 10/Oct/14 ]

dupe of SERVER-15160

Generated at Thu Feb 08 03:26:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.