Overload retargeting prose tests do not ensure that secondaries are discovered

XMLWordPrintableJSON

    • Type: Build Failure
    • Resolution: Fixed
    • Priority: Unknown
    • 4.17.0
    • Affects Version/s: None
    • Component/s: None
    • None
    • None
    • Python Drivers
    • Not Needed
    • None
    • None
    • None
    • None
    • None
    • None

      Name of Failure: 

      test.asynchronous.test_retryable_reads.TestRetryableReads.test_03_01_retryable_reads_caused_by_overload_errors_are_retried_on_a_different_replicaset_server_when_one_is_available_and_overload_retargeting_is_enabled

      Link to task:

      https://spruce.corp.mongodb.com/task/mongo_python_driver_test_macos_arm64_test_standard_latest_python3.12_async_noauth_ssl_replica_set_cov_f31ba09713ff61086f5ebddb43a1e79b419b9802_26_04_15_18_42_29/tests?execution=0&sorts=STATUS%3AASC

      Context of when and why the failure occurred:

      The overload retargeting tests do not ensure that all nodes, including secondaries, are discovered before the test operation begins. Since the linked test requires that at least one secondary be discovered in order to succeed, this creates a race condition.

      Stack trace:

      
      
      [2026/04/15 12:40:23.833] FAILURE: assert ('localhost', 27018) != ('localhost', 27018)
      [2026/04/15 12:40:23.833]  +  where ('localhost', 27018) = <CommandFailedEvent ('localhost', 27018) db: 't', command: 'find', operation_id: 1978461562, duration_micros: 412, failure: {'errorLabels': ['RetryableError', 'SystemOverloadedError'], 'ok': 0.0, 'errmsg': "Failing command via 'failCommand' failpoint", 'code': 6, 'codeName': 'HostUnreachable', '$clusterTime': {'clusterTime': Timestamp(1776281717, 20), 'signature': {'hash': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 'keyId': 0}}, 'operationTime': Timestamp(1776281717, 20)}, service_id: None, server_connection_id: 3871>.connection_id
      [2026/04/15 12:40:23.833]  +  and   ('localhost', 27018) = <CommandSucceededEvent ('localhost', 27018) db: 't', command: 'find', operation_id: 11808384, duration_micros: 458, service_id: None, server_connection_id: 3871>.connection_id ()
      [2026/04/15 12:40:23.833] self = <test.asynchronous.test_retryable_reads.TestRetryableReads testMethod=test_03_01_retryable_reads_caused_by_overload_errors_are_retried_on_a_different_replicaset_server_when_one_is_available_and_overload_retargeting_is_enabled>
      [2026/04/15 12:40:23.833]     @async_client_context.require_replica_set
      [2026/04/15 12:40:23.833]     @async_client_context.require_secondaries_count(1)
      [2026/04/15 12:40:23.833]     @async_client_context.require_failCommand_fail_point
      [2026/04/15 12:40:23.833]     @async_client_context.require_version_min(4, 4, 0)
      [2026/04/15 12:40:23.833]     async def test_03_01_retryable_reads_caused_by_overload_errors_are_retried_on_a_different_replicaset_server_when_one_is_available_and_overload_retargeting_is_enabled(
      [2026/04/15 12:40:23.833]         self
      [2026/04/15 12:40:23.833]     ):
      [2026/04/15 12:40:23.833]         listener = OvertCommandListener()
      [2026/04/15 12:40:23.833]     
      [2026/04/15 12:40:23.833]         # 1. Create a client `client` with `retryReads=true`, `readPreference=primaryPreferred`, `enableOverloadRetargeting=True`, and command event monitoring enabled.
      [2026/04/15 12:40:23.833]         client = await self.async_rs_or_single_client(
      [2026/04/15 12:40:23.833]             event_listeners=[listener],
      [2026/04/15 12:40:23.833]             retryReads=True,
      [2026/04/15 12:40:23.833]             readPreference="primaryPreferred",
      [2026/04/15 12:40:23.833]             enableOverloadRetargeting=True,
      [2026/04/15 12:40:23.833]         )
      [2026/04/15 12:40:23.833]     
      [2026/04/15 12:40:23.833]         # 2. Configure a fail point with the RetryableError and SystemOverloadedError error labels.
      [2026/04/15 12:40:23.833]         command_args = {
      [2026/04/15 12:40:23.833]             "configureFailPoint": "failCommand",
      [2026/04/15 12:40:23.833]             "mode": {"times": 1},
      [2026/04/15 12:40:23.833]             "data": {
      [2026/04/15 12:40:23.833]                 "failCommands": ["find"],
      [2026/04/15 12:40:23.833]                 "errorLabels": ["RetryableError", "SystemOverloadedError"],
      [2026/04/15 12:40:23.833]                 "errorCode": 6,
      [2026/04/15 12:40:23.833]             },
      [2026/04/15 12:40:23.833]         }
      [2026/04/15 12:40:23.833]         await async_set_fail_point(client, command_args)
      [2026/04/15 12:40:23.833]     
      [2026/04/15 12:40:23.833]         # 3. Reset the command event monitor to clear the fail point command from its stored events.
      [2026/04/15 12:40:23.833]         listener.reset()
      [2026/04/15 12:40:23.833]     
      [2026/04/15 12:40:23.833]         # 4. Execute a `find` command with `client`.
      [2026/04/15 12:40:23.833]         await client.t.t.find_one({})
      [2026/04/15 12:40:23.833]     
      [2026/04/15 12:40:23.833]         # 5. Assert that one failed command event and one successful command event occurred.
      [2026/04/15 12:40:23.833]         self.assertEqual(len(listener.failed_events), 1)
      [2026/04/15 12:40:23.833]         self.assertEqual(len(listener.succeeded_events), 1)
      [2026/04/15 12:40:23.833]     
      [2026/04/15 12:40:23.833]         # 6. Assert that both events occurred on different servers.
      [2026/04/15 12:40:23.833] >       assert listener.failed_events[0].connection_id != listener.succeeded_events[0].connection_id
      [2026/04/15 12:40:23.833] E       assert ('localhost', 27018) != ('localhost', 27018)
      [2026/04/15 12:40:23.833] E        +  where ('localhost', 27018) = <CommandFailedEvent ('localhost', 27018) db: 't', command: 'find', operation_id: 1978461562, duration_micros: 412, failure: {'errorLabels': ['RetryableError', 'SystemOverloadedError'], 'ok': 0.0, 'errmsg': "Failing command via 'failCommand' failpoint", 'code': 6, 'codeName': 'HostUnreachable', '$clusterTime': {'clusterTime': Timestamp(1776281717, 20), 'signature': {'hash': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 'keyId': 0}}, 'operationTime': Timestamp(1776281717, 20)}, service_id: None, server_connection_id: 3871>.connection_id
      [2026/04/15 12:40:23.833] E        +  and   ('localhost', 27018) = <CommandSucceededEvent ('localhost', 27018) db: 't', command: 'find', operation_id: 11808384, duration_micros: 458, service_id: None, server_connection_id: 3871>.connection_id
      [2026/04/15 12:40:23.833] test/asynchronous/test_retryable_reads.py:321: AssertionError
      

            Assignee:
            Noah Stapp
            Reporter:
            Noah Stapp
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: