[SERVER-50623] make catchup_takeover_one_high_priority.js more robust to slow machines Created: 28/Aug/20  Updated: 29/Oct/23  Resolved: 04/Sep/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.7.0

Type: Bug Priority: Major - P3
Reporter: Pavithra Vetriselvan Assignee: Pavithra Vetriselvan
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Backwards Compatibility: Fully Compatible
Sprint: Repl 2020-09-07
Participants:
Linked BF Score: 14

 Description   

catchup_takeover_one_high_priority.js will run into the following scenario when running on a slow machine:

  • 3 node replset (node0 and node1 are default priority and node2 has priority 2)
  • wait for node2 to be primary and isolate it (so it can't do a priority takeover)
  • step up node0
  • stop repl on node1 and write something on node0 so it's ahead of node1
  • step up node1 (which is lagged), it'll transition to primary but can't accept writes

here's where things get weird

  • the test expects node0 to do a catchup takeover because it's ahead
  • node0 wins it's dry run election and runs for a real election
  • node0 increments the term, so node1 steps down
  • due to slow machine issues, node0 does not send out a vote request within the election timeout
  • node1 steps up again because of the default election timeout
  • At this point the test's assert.soon fails bc node0 isn't primary like we expect
  • node0 eventually does another catchup takeover
  • succeeds this time, but it's too late because the test failed

Since this test just needs node0 to eventually become primary, we should increase the waitForState timeout here. I would suggest 10 minutes instead of 1 minute. If this call times out after 10 minutes, it would be more indicative of a hang instead of a slow machine issue.

We should consider doing the same change for catchup_takeover_two_nodes_ahead.js, which is also brittle when dealing with slow machines.



 Comments   
Comment by Githook User [ 03/Sep/20 ]

Author:

{'name': 'Pavi Vetriselvan', 'email': 'pavithra.vetriselvan@mongodb.com', 'username': 'pvselvan'}

Message: SERVER-50623 make catchup_takeover_one_high_priority.js robust to slow machines
Branch: master
https://github.com/mongodb/mongo/commit/8db9cd370aeefacb64c035c92b1ebd8bfb9e8ce5

Generated at Thu Feb 08 05:23:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.