[SERVER-44214] Give replica set secondaries votes in concurrency suites Created: 24/Oct/19 Updated: 29/Oct/23 Resolved: 12/Nov/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Concurrency, Testing Infrastructure |
| Affects Version/s: | None |
| Fix Version/s: | 4.3.1 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Maria van Keulen | Assignee: | Judah Schvimer |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Sprint: | Repl 2019-11-18 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
Presently, by default we give zero votes to secondary members in concurrency suites' ReplicaSetFixture . As a result, lastCommitted lag is not reliable because non-voting members are not considered in its calculation. This means that consumers of lastCommitted lag, such as Flow Control, will not have much testing coverage in most concurrency suites. In order to prevent a new primary from being elected, it is sufficient to give secondary members zero priority. IIUC, giving secondaries zero votes just prevents stepdowns of the current primary. We should consider removing the restriction on secondary votes. If it is absolutely necessary to avoid stepdowns in certain cases, we should make sure that these cases are handled individually. |
| Comments |
| Comment by Githook User [ 12/Nov/19 ] |
|
Author: {'name': 'Judah Schvimer', 'username': 'judahschvimer', 'email': 'judah.schvimer@10gen.com'}Message: |
| Comment by Judah Schvimer [ 25/Oct/19 ] |
|
I actually think the primary will not step down until it hasn't heard from a majority of the set in the election timeout period. Thus if we set the election timeout super high, the primary would never step down due to not hearing from secondaries. |
| Comment by Max Hirschhorn [ 24/Oct/19 ] |
The purpose of giving secondaries votes=0 was to prevent the primary from stepping down if it stops exchanging heartbeat messages with the secondaries (because it'll have lost contact with a majority of the replica set). Increasing the election timeout is useful for preventing a secondary from attempting to run for election but doesn't cover all the "slow resmoke.py logging leads to a test failure" cases we've seen. Would we want to try and switch to a mode where we allow a node to remain primary despite not having received a response to a heartbeat message in a timely fashion? |
| Comment by Maria van Keulen [ 24/Oct/19 ] |
|
Flow Control has various fuzzer suites that artificially induce aggressive throttling (see |
| Comment by Judah Schvimer [ 24/Oct/19 ] |
|
We currently raise the election timeout in python fixtures when secondaries vote. I'm not sure why we didn't just use the election timeout always in all fixtures. |
| Comment by Judah Schvimer [ 24/Oct/19 ] |
|
Using a high election timeout could also prevent elections. I forget why we chose zero votes in the first place. max.hirschhorn, do you recall? I do not think it is a good idea to allow the secondaries to step up or the primaries to step down in suites that don't expect it. We spent far too much time debugging BFs there. maria.vankeulen, can you please comment on the amount of test coverage that flow control has and how much this will add to gauge its priority (if we can come to consensus on an acceptable alternative to 0 votes)? |