-
Type:
Bug
-
Resolution: Done
-
Priority:
Critical - P2
-
Affects Version/s: 2.7.0
-
Component/s: Replication
-
Environment:Linux localhost.localdomain 3.14.4-200.fc20.x86_64 #1 SMP Tue May 13 13:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
-
Fully Compatible
-
ALL
-
None
-
0
-
None
-
None
-
None
-
None
-
None
-
None
To reproduce, get
https://github.com/dcci/mongo-replication-perf/blob/master/stepDown.py
and execute with --timeout 10 and --timeout 30.
In the first case, the election time (avg) is about 5 times the second case.
Analysis:
Whenever the "health thread" gets new results, msgCheckNewState() is called.
This function, in case no primary is detected, may result in a call to
electSelf(). Inside electSelf(), at some point, _yea() will be called
so that the server can replica its vote preference.
The code in _yea() looks like this:
const time_t LeaseTime = 30; [...] if( L.when + LeaseTime >= now && L.who != memberId ) { LOG(1) << "replSet not voting yea for " << memberId << " voted for " << L.who << ' ' << now-L.when << " secs ago" << rsLog; throw VoteException(); } [...]
Under some cirumstances, if the stepDown period is too low, the
condition of the if will become true, the message will be logged an
VoteException() thrown.
The exception will be then propagated to the caller and it'll result
in this code being executed:
[...] catch(VoteException& ) { log() << "replSet not trying to elect self as responded yea to someone else recently" << rsLog; } [...]
causing a delay in the election.
Changing LeaseTime to a smaller value hides the problem/makes the problem disappear but exposes some more subtle issues as the one reported int SERVER-14149