-
Type: Bug
-
Resolution: Duplicate
-
Priority: Major - P3
-
None
-
Affects Version/s: 2.4.6, 2.5.3
-
Component/s: Replication
-
ALL
-
During a reconfig primary is relinquished and then recovered almost immediately. This is normal and fine.
Sometimes however an election occurs and it takes 6-8 seconds before the primary resumes being primary. In these problem cases, the primary usually relinquishes because "[rsMgr] can't see a majority of the set, relinquishing primary." This comes from msgCheckNewState, which is run when we receive a heartbeat response.
I believe this happens due to a race in shutting down our ReplSetHealthPollTasks because halting a task only prevents it from running again. This means that if we are waiting on a heartbeat response when we halt the task we will still process that response before the task truly ends.
It may make sense to change the ReplSetHealthPollTask to be a BackgroundJob rather than a Task as stopping a BackgroundJob is more robust and we can wait for the job to actually finish. But this seems like a nontrivial change and probably cannot be gracefully backported to 2.4
- depends on
-
SERVER-15160 TopologyCoordinatorImpl should not always step down on reconfig
- Closed