[SERVER-49305] Remove reconfig retries in our tests Created: 02/Jul/20  Updated: 06/Dec/22  Resolved: 07/Jul/20

Status: Closed
Project: Core Server
Component/s: Replication, Testing Infrastructure
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Judah Schvimer Assignee: Backlog - Replication Team
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-27839 Allow for step downs during reconfig ... Closed
is related to SERVER-31080 replSetReconfig may fail with NewRepl... Closed
is related to SERVER-32028 Make reconfig() in rslib.js resilient... Closed
is related to SERVER-48178 Finding self in reconfig may be inter... Closed
is related to SERVER-46541 Turn Initial Sync Semantics Automatic... Closed
is related to SERVER-27551 QuorumChecker should retry requests t... Closed
is related to SERVER-32638 Permit dblock acquisition waiting to ... Closed
is related to SERVER-40060 Retry replSetInitiate in ReplSetTest ... Closed
Assigned Teams:
Replication
Participants:

 Description   

The reconfig helpers throughout the testing infrastructure retry on various transient errors. Some are due to automatic reconfigs that bump the config version and happen asynchronously, others are due to unnecessary connections being closed due to SERVER-48178. The retries were first added in SERVER-32028, SERVER-40060, SERVER-32638, SERVER-31080, SERVER-27839, and SERVER-27551, but extended significantly in SERVER-46541.

In general our retries are not consistent between python fixture initiate, python fixture reconfig, javascript fixtures, and javascript helpers. It's not clear why these differences exist, or if it's just because of where we've historically seen BFs.

Following closing SERVER-48178 we may be able to remove the NodeNotFound retries (see this comment for details).

We also can consider waiting for 'newlyAdded' removals before issuing reconfigs in more places, rather than just retrying on failures. This might not be straightforward though, since waiting for 'newlyAdded' removals may not always be possible in some testing configurations. This will allow us to not retry on ConfigurationInProgress, ConfigurationNotCommittedYet, and possibly NewReplicaSetConfigurationIncompatible errors.



 Comments   
Comment by Judah Schvimer [ 07/Jul/20 ]

waiting for 'newlyAdded' removals may not always be possible in some testing configurations.

To elaborate, if for example, a test adds a 'newlyAdded' node and then manipulates the set to keep the 'newlyAdded' field present, tests would not be able to wait for 'newlyAdded' removals. A user reconfig would still be possible in this state, since the presence of a 'newlyAdded' field does not preclude further reconfigs, it just means there's a chance of a concurrent reconfig causing the user reconfig to fail. We'd have to special case this situation.

I'm also not sure if we could remove the InterruptedDueToReplStateChange errors in all cases. It was added in SERVER-32638, and I don't think we've done anything to mitigate that. We'd have to investigate why that was added though more deeply.

Generated at Thu Feb 08 05:19:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.