Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-11564

replset reconfigs trigger elections sometimes in some cases in which they shouldn't

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.4.6, 2.5.3
    • Component/s: Replication
    • Labels:
    • ALL
    • Hide

      create a 5 node replicaset with nodes as follows:
      1) priority: 3
      2) priority: 2
      3) priority: 0, hidden: true
      4) arbiter
      5) arbiter

      preform a reconfig that increases the priority of node 1
      this will trigger an unnecessary election much of the time
      if it doesn't change it back and it may
      if that doesn't that change it again
      repeat as needed, which shouldn't be many times (it was happening about 4 out of 5 times for me)

      Show
      create a 5 node replicaset with nodes as follows: 1) priority: 3 2) priority: 2 3) priority: 0, hidden: true 4) arbiter 5) arbiter preform a reconfig that increases the priority of node 1 this will trigger an unnecessary election much of the time if it doesn't change it back and it may if that doesn't that change it again repeat as needed, which shouldn't be many times (it was happening about 4 out of 5 times for me)

      During a reconfig primary is relinquished and then recovered almost immediately. This is normal and fine.

      Sometimes however an election occurs and it takes 6-8 seconds before the primary resumes being primary. In these problem cases, the primary usually relinquishes because "[rsMgr] can't see a majority of the set, relinquishing primary." This comes from msgCheckNewState, which is run when we receive a heartbeat response.

      I believe this happens due to a race in shutting down our ReplSetHealthPollTasks because halting a task only prevents it from running again. This means that if we are waiting on a heartbeat response when we halt the task we will still process that response before the task truly ends.

      It may make sense to change the ReplSetHealthPollTask to be a BackgroundJob rather than a Task as stopping a BackgroundJob is more robust and we can wait for the job to actually finish. But this seems like a nontrivial change and probably cannot be gracefully backported to 2.4

            Assignee:
            spencer@mongodb.com Spencer Brody (Inactive)
            Reporter:
            matt.dannenberg Matt Dannenberg
            Votes:
            1 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: