Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 9.0.0-rc0
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Cluster Scalability
Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Sprint:
ClusterScalability 16Mar-30Mar
Linked BF Score:
0
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Between every test in sharded test suites with the balancer enabled, resmoke issues a balancerStop command to the config server.

On TSAN variants or otehr slower variants , the config server can become briefly unavailable (perhaps during an election and/or lag due to TSAN). When this lines up with the between-test balancerStop call, we get a FailedToSatisfyReadPreference back from the server (primary CRS is not found), and since stop_balancer() has zero retry logic, this takes out the whole job. Example: In BF-42225 banned_collection_reads_in_txn.js passed fine, then the between-test balancerStop call fails, and every subsequent test errored out.

Proposed fix:
Add retry logic to stop_balancer() for transient errors like FailedToSatisfyReadPreference etc. We already do exactly this for _add_shard() (~~SERVER-121265~~). We should apply the same pattern here. We could also add an is_running() guard in the between-test hook path in job.py

is related to

SERVER-121265 Add retries for _add_shard in ShardedClusterFixture

Closed

SERVER-121483 Skip balancerStop with dead nodes in shardedcluster.py teardown

Closed

SERVER-120863 stop_balancer fails with _OperationCancelled on slow variants due to low connectTimeoutMS

Closed

Assignee:: Rehan Gill
Reporter:: Rehan Gill
Participants:: Githook User, Rehan Gill
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: Mar 24 2026 04:22:43 PM UTC
Updated:: Mar 28 2026 01:40:56 AM UTC
Resolved:: Mar 28 2026 01:40:56 AM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates