[SERVER-34666] Reduce the number of retries needed for running the retryable_writes_jscore_stepdown_passthrough.yml test suite Created: 25/Apr/18  Updated: 12/Dec/23

Status: Backlog
Project: Core Server
Component/s: Sharding, Testing Infrastructure
Affects Version/s: None
Fix Version/s: 4.1 Desired

Type: Improvement Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Backlog - Cluster Scalability
Resolution: Unresolved Votes: 0
Labels: RachitaD, gm-ack, open_todo_in_code
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-36128 ReplicationCoordinatorImpl::fillIsMas... Closed
depends on SERVER-34665 The mongo shell should retry writes o... Closed
Duplicate
is duplicated by SERVER-35225 retryOnNetworkErrors does not subtrac... Closed
Related
is related to SERVER-34608 Drivers may still see ismaster=true f... Closed
Assigned Teams:
Cluster Scalability
Participants:

 Description   

Given the changes from SERVER-34665 which exposes a Mongo.prototype._markHostAsFailed() function to call ReplicaSetMonitor::failedHost(), it shouldn't be necessary to use multiple retry attempts as a way to wait for the ReplicaSetMonitor to discover a new primary has been elected because retargeting can be triggered explicitly. The auto_retry_on_network_error.js override could then use this mechanism rather than setting kMaxNumRetries=3 and could similarly remove TestData.overrideRetryAttempts=3 from the YAML suite definition.

Note: SERVER-34608 describes a case where after receiving an InterruptedDueToReplStateChange error response that an "isMaster" command could still observe ismaster=true and could therefore cause server selection to pick a node which is still in the midst of stepping down. We could avoid decrementing the numRetries counter in this case of an InterruptedDueToReplStateChange error response because the first retry (i.e. the second attempt) will synchronize with the stepdown to finish and the mongo shell would observe a network error. A second retry (i.e. a third attempt) would be successfully targeted at whichever node is then elected the new primary.



 Comments   
Comment by Max Hirschhorn [ 13/Jul/18 ]

I'm marking this ticket as dependent on SERVER-36128 because I'd be worried that in "terminate_primary" version of these stepdown-like test suites that we'd exhaust the number of retries too quickly.

Comment by Max Hirschhorn [ 09/Jun/18 ]

The number of retries is not decremented when any of the "continue" lines are hit: https://github.com/mongodb/mongo/blob/dea326f41fbca28ca83f881bff1591b0f95ed645/jstests/libs/override_methods/auto_retry_on_network_error.js#L348

As mentioned by judah.schvimer in SERVER-35225, we should also take care to ensure that we're respecting the number of retry attempts assuming that forcing retargeting with Mongo.prototype._markHostAsFailed() is sufficient to ensure we don't need an infinite number of retries.

Comment by Max Hirschhorn [ 25/Apr/18 ]

I don't see a reason we actually need to shorten the election timeout when running the retryable_writes_jscore_stepdown_passthrough.yml test suite. If anything, it seems prone to causing a failover that isn't intentionally triggered by resmoke.py's StepdownThread and thus occurs at a point when a test or data consistency check isn't prepared to handle it.

Edit: We're planning to address the election timeout in SERVER-35383.

Generated at Thu Feb 08 04:37:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.