-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Replication
-
None
-
Fully Compatible
-
Repl 2020-09-07
-
14
catchup_takeover_one_high_priority.js will run into the following scenario when running on a slow machine:
- 3 node replset (node0 and node1 are default priority and node2 has priority 2)
- wait for node2 to be primary and isolate it (so it can't do a priority takeover)
- step up node0
- stop repl on node1 and write something on node0 so it's ahead of node1
- step up node1 (which is lagged), it'll transition to primary but can't accept writes
here's where things get weird
- the test expects node0 to do a catchup takeover because it's ahead
- node0 wins it's dry run election and runs for a real election
- node0 increments the term, so node1 steps down
- due to slow machine issues, node0 does not send out a vote request within the election timeout
- node1 steps up again because of the default election timeout
- At this point the test's assert.soon fails bc node0 isn't primary like we expect
- node0 eventually does another catchup takeover
- succeeds this time, but it's too late because the test failed
Since this test just needs node0 to eventually become primary, we should increase the waitForState timeout here. I would suggest 10 minutes instead of 1 minute. If this call times out after 10 minutes, it would be more indicative of a hang instead of a slow machine issue.
We should consider doing the same change for catchup_takeover_two_nodes_ahead.js, which is also brittle when dealing with slow machines.