[SERVER-45765] Race in ReplSetTest.initiateWithAnyNodeAsPrimary Created: 24/Jan/20  Updated: 29/Oct/23  Resolved: 25/Jan/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.3.3

Type: Bug Priority: Major - P3
Reporter: A. Jesse Jiryu Davis Assignee: A. Jesse Jiryu Davis
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Problem/Incident
is caused by SERVER-43766 Investigate the slowest sections of R... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:
Linked BF Score: 29

 Description   

In replsettest.js, initiateWithAnyNodeAsPrimary:

  1. Call replSetInitiate on one node with a one-node config
  2. Call getPrimary(), which initializes self._slaves
  3. Call replSetReconfig in a loop to add remaining nodes one at a time
  4. Call this.awaitSecondaryNodes(self.kDefaultTimeoutMS, self._slaves, 25 /* retryIntervalMS */);
  5. In awaitSecondaryNodes, call isMaster on each node in "slaves". Repeat until all slave nodes are secondaries/arbiters.

If there's an election any time after Step 3, then one of the members of self._slaves could be a primary now. However, so awaitSecondaryNodes keeps trying the same set of nodes until it times out. 

Observed in replsettest_control_12_nodes.js. It's probably more common now for a machine to get overloaded, causing heartbeat timeouts and elections:

  1. The test starts 12 nodes, the upper limit
  2. The nodes are all started in parallel after SERVER-43772
  3. There is more time spent in step 3 now that SERVER-45079 requires we add one member at a time


 Comments   
Comment by A. Jesse Jiryu Davis [ 27/Jan/20 ]

The race was introduced in SERVER-43766. Before then, ReplSetTest would proceed once all nodes were primary, secondary, or arbiter. After that change, the test requires a specific set of nodes to be secondary or arbiter, hence the test times out if there's an election before it begins waiting. This change introduced the bug:

https://github.com/mongodb/mongo/commit/f5a2d477761f5d954ea63a8c8a6cfa02d124e4a7#diff-bab9f33827828bf17f93734f9a93706dR1241

Comment by Githook User [ 25/Jan/20 ]

Author:

{'username': 'ajdavis', 'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com'}

Message: SERVER-45765 Race in initiateWithAnyNodeAsPrimary

If there's an election between the call to replSetInitiate and the last call to
replSetReconfig, the test would nevertheless expect Node 0 to be primary and
all others to be secondaries. Fix it so will continue as soon as all nodes are
primary, secondary, or arbiter.

Also add resource_intensive tag to a test that starts 12 nodes.
Branch: master
https://github.com/mongodb/mongo/commit/38ec223f7478be14fc3bf082643a1109efaeb57c

Generated at Thu Feb 08 05:09:39 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.