stop_balancer in between-test hooks lacks retry logic

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Fixed
    • Priority: Major - P3
    • 9.0.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • None
    • Cluster Scalability
    • Fully Compatible
    • ALL
    • ClusterScalability 16Mar-30Mar
    • 0
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Between every test in sharded test suites with the balancer enabled, resmoke issues a balancerStop command to the config server.

      On TSAN variants or otehr slower variants , the config server can become briefly unavailable (perhaps during an election and/or lag due to TSAN). When this lines up with the between-test balancerStop call, we get a FailedToSatisfyReadPreference back from the server (primary CRS is not found), and since stop_balancer() has zero retry logic, this takes out the whole job. Example: In BF-42225 banned_collection_reads_in_txn.js passed fine, then the between-test balancerStop call fails, and every subsequent test errored out.

      Proposed fix:
      Add retry logic to stop_balancer() for transient errors like FailedToSatisfyReadPreference etc. We already do exactly this for _add_shard() (SERVER-121265). We should apply the same pattern here. We could also add an is_running() guard in the between-test hook path in job.py

            Assignee:
            Rehan Gill
            Reporter:
            Rehan Gill
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: