[SERVER-32767] Add retries to ReplSetTest._callIsMaster on error Created: 18/Jan/18  Updated: 01/May/18  Resolved: 26/Apr/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Judah Schvimer Assignee: Tess Avitabile (Inactive)
Resolution: Won't Fix Votes: 0
Labels: rbfz
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-33287 Create passthrough that kills the pri... Closed
Related
related to SERVER-34526 Remove the livenodes list from repl t... Closed
Operating System: ALL
Sprint: Repl 2018-05-07
Participants:

 Description   

If _callIsMaster finds the master node and sets it as self.liveNodes.master and then gets an error calling ismaster on a secondary, that secondary will not be put in the liveNodes list, but _callIsMaster will return the master. Other functions rely on this liveNodes list to be accurate.



 Comments   
Comment by Tess Avitabile (Inactive) [ 26/Apr/18 ]

Closing in favor of SERVER-34526.

Comment by Daniel Gottlieb (Inactive) [ 17/Apr/18 ]

I hit a test failure caused by "_liveNodes" being an implementation detail which can change on a whim between calls to awaitReplication and checkOplogs. I think Matthew's suggestion is a more robust fix where tests can have control over which nodes they expect to be running and are valid to do data consistency checks on.

That being said, my case would have been fixed with a small sleep + retry. Given the status of this ticket hasn't budged, I would accept expediting this simple fix.

Comment by Max Hirschhorn [ 15/Apr/18 ]

I'm requesting that the Replication team re-triage this ticket. I suspect we'll need some way to wait for a node to finish rolling back (if it was going to) as part of the data consistency checks in order to avoid intermittent failures in some of the new test suites we'll be adding around recoverable rollback.

Comment by Judah Schvimer [ 12/Feb/18 ]

I agree, but one failed isMaster does not mean the node is dead, it could just be a network blip. This ticket suggests retries as an attempt to ensure the node is dead before not including it.

Comment by Matthew Russotto [ 09/Feb/18 ]

I think the liveNodes list is accurate; a node that returns a failure from isMaster is by definition not live. Possibly awaitReplication should use self.nodes instead of self.liveNodes; this would work unless we use awaitReplication in cases where a node is down. Otherwise, the hook should call one of the "awaitNodesAgreeOn" functions to ensure all nodes are up.

Comment by Judah Schvimer [ 18/Jan/18 ]

It also seems like something is likely wrong with the calling test if awaitReplication ever finds 0 secondaries. It may be worth failing if this occurs.

Generated at Thu Feb 08 04:31:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.