[SERVER-35383] Increase electionTimeoutMillis for the ContinuousStepdown hook used in stepdown suites Created: 04/Jun/18  Updated: 29/Oct/23  Resolved: 02/Jul/18

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 3.6.7, 4.0.1, 4.1.1

Type: Task Priority: Major - P3
Reporter: Jonathan Abrahams Assignee: Max Hirschhorn
Resolution: Fixed Votes: 0
Labels: tig-resmoke
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Problem/Incident
causes SERVER-36817 replSetFreeze command run by stepdown... Closed
Related
related to SERVER-36448 Disable election handoff in suites th... Closed
related to SERVER-36451 ContinuousStepdown with killing nodes... Closed
is related to SERVER-30642 Raise election timeouts as a way to p... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.0, v3.6
Sprint: TIG 2018-07-02, TIG 2018-07-16
Participants:
Linked BF Score: 27
Story Points: 2

 Description   

The electionTimeoutMillis parameter for the ContinuousStepdown hook, used in the concurrency stepdown suites, is set to 5000. We should increase this per the captured discussion:

> > On 2018/05/30 22:09:12, maxh wrote:
> > > [note] As mentioned in SERVER-34666, I don't think we should shorten the
> > > election timeout as it can lead to an election happening that isn't
> initiated
> > by
> > > the StepdownThread due to heartbeats being delayed. I'm okay with keeping
it
> > > as-is for now because it is consistent with the replica set configuration
> the
> > > JavaScript version would have used; however, I'd like for there to be a
> > > follow-up SERVER ticket to change it.
> > >
> > >
> >
>
https://jira.mongodb.org/browse/SERVER-34666?focusedCommentId=1873407&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-1873407
> >
> > For the followup ticket, do we just want to remove this value and use the
> > default, or set it to a higher timeout?
>
> I'm not sure - I'd like to get some input from Judah on it. I'm currently
> wondering if we really need to avoid setting the election timeout to 24 hours
> when all_nodes_electable=true. We're going to use the replSetStepUp command in
> the Python version of the StepdownThread to cause one of the secondaries to
run
> for election anyway. If for some reason the replSetStepUp command fails, then
> the former primary will try and step back up after 10 seconds on its own
anyway.
>
>
https://github.com/mongodb/mongo/blob/r4.1.0/buildscripts/resmokelib/testing/fixtures/replicaset.py#L149-L154

If you only want elections to come from the StepdownThread, then I'd recommend
setting the election timeout to 24 hours. The replSetStepUp command should still
work, and if it fails for some reason, then no other node will try to run for
election. There's no real difference between the default 10 seconds and the
current 5 seconds except for the amount of flakiness you'd expect (not the
existence of flakiness that we're trying to remove completely).



 Comments   
Comment by Githook User [ 14/Jul/18 ]

Author:

{'username': 'visemet', 'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com'}

Message: SERVER-35383 Raise election timeout to 24 hours for stepdown suites.

(cherry picked from commit 99d3436094d31de348edfac9fe0e40e60b28391e)
Branch: v3.6
https://github.com/mongodb/mongo/commit/f1bcba35cefd0c5c0402e32575327a77507ac03e

Comment by Githook User [ 07/Jul/18 ]

Author:

{'username': 'visemet', 'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com'}

Message: SERVER-35383 Raise election timeout to 24 hours for stepdown suites.

(cherry picked from commit 99d3436094d31de348edfac9fe0e40e60b28391e)
Branch: v4.0
https://github.com/mongodb/mongo/commit/7ff53a32cff306f2361c7ca0971994768dc66f80

Comment by Githook User [ 02/Jul/18 ]

Author:

{'username': 'visemet', 'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com'}

Message: SERVER-35383 Raise election timeout to 24 hours for stepdown suites.
Branch: master
https://github.com/mongodb/mongo/commit/99d3436094d31de348edfac9fe0e40e60b28391e

Comment by Max Hirschhorn [ 06/Jun/18 ]

The election timeout should be raised to the default that resmoke.py sets of 24 hours for all stepdown suites which run the StepdownThread and not just for the concurrency stepdown suites. This can likely be achieved by removing the all_nodes_electable part of this condition.

Generated at Thu Feb 08 04:39:39 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.