[SERVER-14151] Election time increases in case of frequent stepdown Created: 03/Jun/14  Updated: 07/Oct/15  Resolved: 21/Sep/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.7.0
Fix Version/s: 3.1.9

Type: Bug Priority: Critical - P2
Reporter: Davide Italiano Assignee: Eric Milkie
Resolution: Done Votes: 0
Labels: 28qa
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux localhost.localdomain 3.14.4-200.fc20.x86_64 #1 SMP Tue May 13 13:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux


Issue Links:
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

To reproduce, get
https://github.com/dcci/mongo-replication-perf/blob/master/stepDown.py
and execute with --timeout 10 and --timeout 30.
In the first case, the election time (avg) is about 5 times the second case.

Analysis:

Whenever the "health thread" gets new results, msgCheckNewState() is called.
This function, in case no primary is detected, may result in a call to
electSelf(). Inside electSelf(), at some point, _yea() will be called
so that the server can replica its vote preference.

The code in _yea() looks like this:

const time_t LeaseTime = 30;
 
[...]
 
        if( L.when + LeaseTime >= now && L.who != memberId ) {
            LOG(1) << "replSet not voting yea for " << memberId <<
                   " voted for " << L.who << ' ' << now-L.when << "
secs ago" << rsLog;
            throw VoteException();
        }
 
[...]

Under some cirumstances, if the stepDown period is too low, the
condition of the if will become true, the message will be logged an
VoteException() thrown.

The exception will be then propagated to the caller and it'll result
in this code being executed:

[...]
        catch(VoteException& ) {
            log() << "replSet not trying to elect self as responded
yea to someone else recently" << rsLog;
        }
[...]

causing a delay in the election.

Changing LeaseTime to a smaller value hides the problem/makes the problem disappear but exposes some more subtle issues as the one reported int SERVER-14149



 Comments   
Comment by Eric Milkie [ 21/Sep/15 ]

With the new election protocol enhancements, this problem is no longer an issue.

Comment by Eric Milkie [ 17/Jul/14 ]

We'll take a look at this after the replication refactoring is complete.

Generated at Thu Feb 08 03:34:00 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.