stop_balancer fails with _OperationCancelled on slow variants due to low connectTimeoutMS

XMLWordPrintableJSON

    • Type: Improvement
    • Resolution: Fixed
    • Priority: Major - P3
    • 8.3.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • None
    • Cluster Scalability
    • Fully Compatible
    • ClusterScalability 2Mar-16Mar
    • 200
    • 1
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      On slow build variants like TSAN debug, the stop_balancer() call in ShardedClusterFixture fails with pymongo.errors._OperationCancelled prematurely despite the server-side balancerStop command eventually succeeding. stop_balancer() creates a MongoClient with the default connectTimeoutMS of 30 seconds but passes maxTimeMS=300000 (5min) to the server command. Under TSAN, localhost round-trip times for hello/heartbeat commands can reach higher than 30 seconds. When the PyMongo SDAM monitor's streaming hello exceeds the 30s socket timeout (derived from connectTimeoutMS), it resets the connection pool with interrupt_connections=True, which cancels the in-flight balancerStop command via _OperationCancelled.

      We should make connectTimeoutMS also match maxTimeMS.

            Assignee:
            Abdul Qadeer
            Reporter:
            Abdul Qadeer
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: