Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-14151

Election time increases in case of frequent stepdown

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical - P2
    • Resolution: Fixed
    • Affects Version/s: 2.7.0
    • Fix Version/s: 3.1.9
    • Component/s: Replication
    • Labels:
    • Environment:
      Linux localhost.localdomain 3.14.4-200.fc20.x86_64 #1 SMP Tue May 13 13:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL

      Description

      To reproduce, get
      https://github.com/dcci/mongo-replication-perf/blob/master/stepDown.py
      and execute with --timeout 10 and --timeout 30.
      In the first case, the election time (avg) is about 5 times the second case.

      Analysis:

      Whenever the "health thread" gets new results, msgCheckNewState() is called.
      This function, in case no primary is detected, may result in a call to
      electSelf(). Inside electSelf(), at some point, _yea() will be called
      so that the server can replica its vote preference.

      The code in _yea() looks like this:

      const time_t LeaseTime = 30;
       
      [...]
       
              if( L.when + LeaseTime >= now && L.who != memberId ) {
                  LOG(1) << "replSet not voting yea for " << memberId <<
                         " voted for " << L.who << ' ' << now-L.when << "
      secs ago" << rsLog;
                  throw VoteException();
              }
       
      [...]
      

      Under some cirumstances, if the stepDown period is too low, the
      condition of the if will become true, the message will be logged an
      VoteException() thrown.

      The exception will be then propagated to the caller and it'll result
      in this code being executed:

      [...]
              catch(VoteException& ) {
                  log() << "replSet not trying to elect self as responded
      yea to someone else recently" << rsLog;
              }
      [...]
      

      causing a delay in the election.

      Changing LeaseTime to a smaller value hides the problem/makes the problem disappear but exposes some more subtle issues as the one reported int SERVER-14149

        Attachments

          Activity

            People

            Assignee:
            milkie Eric Milkie
            Reporter:
            davide.italiano Davide Italiano
            Participants:
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: