[SERVER-44214] Give replica set secondaries votes in concurrency suites Created: 24/Oct/19  Updated: 29/Oct/23  Resolved: 12/Nov/19

Status: Closed
Project: Core Server
Component/s: Concurrency, Testing Infrastructure
Affects Version/s: None
Fix Version/s: 4.3.1

Type: Improvement Priority: Major - P3
Reporter: Maria van Keulen Assignee: Judah Schvimer
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-30642 Raise election timeouts as a way to p... Closed
is related to SERVER-31670 Change replica set fixture used by re... Closed
is related to SERVER-32688 FSM replication suites should give se... Closed
Backwards Compatibility: Fully Compatible
Sprint: Repl 2019-11-18
Participants:

 Description   

Presently, by default we give zero votes to secondary members in concurrency suites' ReplicaSetFixture . As a result, lastCommitted lag is not reliable because non-voting members are not considered in its calculation. This means that consumers of lastCommitted lag, such as Flow Control, will not have much testing coverage in most concurrency suites.

In order to prevent a new primary from being elected, it is sufficient to give secondary members zero priority. IIUC, giving secondaries zero votes just prevents stepdowns of the current primary.

We should consider removing the restriction on secondary votes. If it is absolutely necessary to avoid stepdowns in certain cases, we should make sure that these cases are handled individually.



 Comments   
Comment by Githook User [ 12/Nov/19 ]

Author:

{'name': 'Judah Schvimer', 'username': 'judahschvimer', 'email': 'judah.schvimer@10gen.com'}

Message: SERVER-44214 give all nodes votes with high election timeouts in tests
Branch: master
https://github.com/mongodb/mongo/commit/c57a899b676a8b9fe35fdc2147c12602deda4274

Comment by Judah Schvimer [ 25/Oct/19 ]

I actually think the primary will not step down until it hasn't heard from a majority of the set in the election timeout period. Thus if we set the election timeout super high, the primary would never step down due to not hearing from secondaries.

Comment by Max Hirschhorn [ 24/Oct/19 ]

Using a high election timeout could also prevent elections. I forget why we chose zero votes in the first place. Max Hirschhorn, do you recall? I do not think it is a good idea to allow the secondaries to step up or the primaries to step down in suites that don't expect it. We spent far too much time debugging BFs there.

The purpose of giving secondaries votes=0 was to prevent the primary from stepping down if it stops exchanging heartbeat messages with the secondaries (because it'll have lost contact with a majority of the replica set). Increasing the election timeout is useful for preventing a secondary from attempting to run for election but doesn't cover all the "slow resmoke.py logging leads to a test failure" cases we've seen.

Would we want to try and switch to a mode where we allow a node to remain primary despite not having received a response to a heartbeat message in a timely fashion?

Comment by Maria van Keulen [ 24/Oct/19 ]

Flow Control has various fuzzer suites that artificially induce aggressive throttling (see SERVER-41241). It is enabled by default in both correctness and performance testing, but in practice it does not engage often in correctness testing other than in those aforementioned suites. SERVER-43870 is an example of a situation that theoretically would have engaged Flow Control, but did not because of our ReplicaSetFixture configuration.

Comment by Judah Schvimer [ 24/Oct/19 ]

We currently raise the election timeout in python fixtures when secondaries vote. I'm not sure why we didn't just use the election timeout always in all fixtures.

Comment by Judah Schvimer [ 24/Oct/19 ]

Using a high election timeout could also prevent elections. I forget why we chose zero votes in the first place. max.hirschhorn, do you recall? I do not think it is a good idea to allow the secondaries to step up or the primaries to step down in suites that don't expect it. We spent far too much time debugging BFs there.

maria.vankeulen, can you please comment on the amount of test coverage that flow control has and how much this will add to gauge its priority (if we can come to consensus on an acceptable alternative to 0 votes)?

Generated at Thu Feb 08 05:05:21 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.