[SERVER-30642] Raise election timeouts as a way to provide more stable replica set test topologies Created: 14/Aug/17  Updated: 30/Oct/23  Resolved: 26/Feb/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 3.6.5, 3.7.3

Type: Bug Priority: Major - P3
Reporter: William Schultz (Inactive) Assignee: Jonathan Abrahams
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
depends on SERVER-32688 FSM replication suites should give se... Closed
depends on SERVER-32794 Make timeouts unrelated to elections ... Closed
Related
related to SERVER-38749 Concurrent stepdown suites on 3.6 bra... Closed
related to SERVER-35383 Increase electionTimeoutMillis for th... Closed
related to SERVER-44214 Give replica set secondaries votes in... Closed
is related to SERVER-32691 Create passthrough for w="majority" w... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v3.6
Sprint: Repl 2018-01-29, Repl 2018-02-12, TIG 2018-03-12
Participants:
Linked BF Score: 15

 Description   

For Javascript tests that aren't trying to directly test any aspect of the consensus machinery, we should consider making unwanted elections impossible, so as to cut down on the issue of spurious topology changes interfering with the actions a test is executing. Raising election timeouts to some very high value could be one solution to this. It would make tests more resilient to machine/network slowness, and improve their stability. Setting the priority of secondary nodes to 0 (in addition to high election timeouts) could also help reduce the triggering of unexpected election.

We may want to consider reviewing tests and see which ones we consider "consensus agnostic", and those we do not.



 Comments   
Comment by Githook User [ 19/Apr/18 ]

Author:

{'email': 'jonathan@mongodb.com', 'username': 'hptabster', 'name': 'Jonathan Abrahams'}

Message: SERVER-30642 Raise election timeouts, in the concurrency tests, as a way to provide more stable replica set test topologies

(cherry picked from commit 3aa315557bef775c5291068e365a59a3a810fc41)
Branch: v3.6
https://github.com/mongodb/mongo/commit/bfc56f2c54ba83a4041213b82b187e36cf1a9af1

Comment by Githook User [ 19/Apr/18 ]

Author:

{'email': 'judah@mongodb.com', 'username': 'judahschvimer', 'name': 'Judah Schvimer'}

Message: SERVER-30642 Raise election timeouts in python fixtures

(cherry picked from commit 6a1e6fe87e7d510d2e795263520e918c9033e044)
Branch: v3.6
https://github.com/mongodb/mongo/commit/4b0ccd4520b1e038f32088a6934a1587a7316743

Comment by Githook User [ 26/Feb/18 ]

Author:

{'email': 'jonathan@mongodb.com', 'name': 'Jonathan Abrahams', 'username': 'hptabster'}

Message: SERVER-30642 Raise election timeouts, in the concurrency tests, as a way to provide more stable replica set test topologies
Branch: master
https://github.com/mongodb/mongo/commit/3aa315557bef775c5291068e365a59a3a810fc41

Comment by Judah Schvimer [ 09/Feb/18 ]

I anticipated the scope of this ticket to be limited to fsm suites that 0 votes doesn't fix along the same lines as the python fixtures. I do not think this should affect any tests in the replsets or sharding directories.

Comment by Max Hirschhorn [ 09/Feb/18 ]

We may want to consider reviewing tests and see which ones we consider "consensus agnostic", and those we do not.

judah.schvimer, are you anticipating that the TIG team would do this audit, or what is your expectation of the code changes being made here?

Comment by Judah Schvimer [ 09/Feb/18 ]

Sending to TIG to finish after SERVER-32688.

Comment by Githook User [ 23/Jan/18 ]

Author:

{'name': 'Judah Schvimer', 'email': 'judah@mongodb.com', 'username': 'judahschvimer'}

Message: SERVER-30642 Raise election timeouts in python fixtures
Branch: master
https://github.com/mongodb/mongo/commit/6a1e6fe87e7d510d2e795263520e918c9033e044

Comment by William Schultz (Inactive) [ 14/Aug/17 ]

There's no reason to believe this test is having "more spurious failovers" than others, but it is one example of a test with such an issue, and so I figured it could be a starting point for this kind of fix, since it has definitively caused a number of build failures due to this issue. But yes, arguably, for tests that only require a stable replica set topology i.e. aren't trying to exercise elections, I think that something like maximizing the election timeout could be a good way to make them more stable and resilient to slow hardware, network issues, etc. Reviewing all tests to see if they fall into this category, however, would likely be a larger task.

Comment by Spencer Brody (Inactive) [ 14/Aug/17 ]

Why is this test having more spurious failovers than other tests? Is there a reason we should do this for this test but not all other tests that don't explicitly test election timeouts?

Generated at Thu Feb 08 04:24:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.