[SERVER-32767] Add retries to ReplSetTest._callIsMaster on error Created: 18/Jan/18 Updated: 01/May/18 Resolved: 26/Apr/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Judah Schvimer | Assignee: | Tess Avitabile (Inactive) |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | rbfz | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Sprint: | Repl 2018-05-07 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
If _callIsMaster finds the master node and sets it as self.liveNodes.master and then gets an error calling ismaster on a secondary, that secondary will not be put in the liveNodes list, but _callIsMaster will return the master. Other functions rely on this liveNodes list to be accurate. |
| Comments |
| Comment by Tess Avitabile (Inactive) [ 26/Apr/18 ] |
|
Closing in favor of |
| Comment by Daniel Gottlieb (Inactive) [ 17/Apr/18 ] |
|
I hit a test failure caused by "_liveNodes" being an implementation detail which can change on a whim between calls to awaitReplication and checkOplogs. I think Matthew's suggestion is a more robust fix where tests can have control over which nodes they expect to be running and are valid to do data consistency checks on. That being said, my case would have been fixed with a small sleep + retry. Given the status of this ticket hasn't budged, I would accept expediting this simple fix. |
| Comment by Max Hirschhorn [ 15/Apr/18 ] |
|
I'm requesting that the Replication team re-triage this ticket. I suspect we'll need some way to wait for a node to finish rolling back (if it was going to) as part of the data consistency checks in order to avoid intermittent failures in some of the new test suites we'll be adding around recoverable rollback. |
| Comment by Judah Schvimer [ 12/Feb/18 ] |
|
I agree, but one failed isMaster does not mean the node is dead, it could just be a network blip. This ticket suggests retries as an attempt to ensure the node is dead before not including it. |
| Comment by Matthew Russotto [ 09/Feb/18 ] |
|
I think the liveNodes list is accurate; a node that returns a failure from isMaster is by definition not live. Possibly awaitReplication should use self.nodes instead of self.liveNodes; this would work unless we use awaitReplication in cases where a node is down. Otherwise, the hook should call one of the "awaitNodesAgreeOn" functions to ensure all nodes are up. |
| Comment by Judah Schvimer [ 18/Jan/18 ] |
|
It also seems like something is likely wrong with the calling test if awaitReplication ever finds 0 secondaries. It may be worth failing if this occurs. |