-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: None
-
None
-
Cluster Scalability
-
Fully Compatible
-
ALL
-
ClusterScalability 16Mar-30Mar
-
0
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Between every test in sharded test suites with the balancer enabled, resmoke issues a balancerStop command to the config server.
On TSAN variants or otehr slower variants , the config server can become briefly unavailable (perhaps during an election and/or lag due to TSAN). When this lines up with the between-test balancerStop call, we get a FailedToSatisfyReadPreference back from the server (primary CRS is not found), and since stop_balancer() has zero retry logic, this takes out the whole job. Example: In BF-42225 banned_collection_reads_in_txn.js passed fine, then the between-test balancerStop call fails, and every subsequent test errored out.
Proposed fix:
Add retry logic to stop_balancer() for transient errors like FailedToSatisfyReadPreference etc. We already do exactly this for _add_shard() (SERVER-121265). We should apply the same pattern here. We could also add an is_running() guard in the between-test hook path in job.py
- is related to
-
SERVER-121265 Add retries for _add_shard in ShardedClusterFixture
-
- Closed
-
-
SERVER-121483 Skip balancerStop with dead nodes in shardedcluster.py teardown
-
- Closed
-
-
SERVER-120863 stop_balancer fails with _OperationCancelled on slow variants due to low connectTimeoutMS
-
- Closed
-