-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Unknown
-
Affects Version/s: None
-
Component/s: None
-
None
There are two flakey tests in discovery_and_monitoring.py:
- Error returned from connection pool clear with interruptInUseConnections=true is retryable
- Error returned from connection pool clear with interruptInUseConnections=true is retryable for write
The test will fail with an `_OperationCancelled` error. Stack trace below:
[2025/02/27 10:01:42.179] FAILURE: pymongo.errors._OperationCancelled: operation cancelled () [2025/02/27 10:01:42.179] self = <test.asynchronous.test_discovery_and_monitoring.TestUnifiedInterruptInUsePoolClear testMethod=test_Error_returned_from_connection_pool_clear_with_interruptInUseConnections=true_is_retryable> [2025/02/27 10:01:42.179] async def test_case(self): [2025/02/27 10:01:42.179] > await self.run_scenario(spec) [2025/02/27 10:01:42.179] test\asynchronous\unified_format.py:1469: [2025/02/27 10:01:42.179] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ [2025/02/27 10:01:42.179] test\asynchronous\unified_format.py:1411: in run_scenario [2025/02/27 10:01:42.179] await self._run_scenario(spec, uri) [2025/02/27 10:01:42.179] test\asynchronous\unified_format.py:1446: in _run_scenario [2025/02/27 10:01:42.179] await self.run_operations(spec["operations"]) [2025/02/27 10:01:42.179] test\asynchronous\unified_format.py:1245: in run_operations [2025/02/27 10:01:42.179] await self.run_special_operation(op) [2025/02/27 10:01:42.179] test\asynchronous\unified_format.py:1238: in run_special_operation [2025/02/27 10:01:42.179] await method(spec["arguments"]) [2025/02/27 10:01:42.179] test\asynchronous\unified_format.py:1188: in _testOperation_waitForThread [2025/02/27 10:01:42.179] raise thread.exc [2025/02/27 10:01:42.179] test\asynchronous\utils_spec_runner.py:85: in run [2025/02/27 10:01:42.179] await work() [2025/02/27 10:01:42.179] test\asynchronous\unified_format.py:982: in run_entity_operation [2025/02/27 10:01:42.179] result = await cmd(**dict(arguments)) [2025/02/27 10:01:42.179] test\asynchronous\unified_format.py:785: in _collectionOperation_find [2025/02/27 10:01:42.179] return await find_cursor.to_list() [2025/02/27 10:01:42.179] pymongo\_csot.py:109: in csot_wrapper [2025/02/27 10:01:42.179] return await func(self, *args, **kwargs) [2025/02/27 10:01:42.179] pymongo\asynchronous\cursor.py:1318: in to_list [2025/02/27 10:01:42.179] if not await self._next_batch(res, remaining): [2025/02/27 10:01:42.179] pymongo\asynchronous\cursor.py:1274: in _next_batch [2025/02/27 10:01:42.179] if len(self._data) or await self._refresh(): [2025/02/27 10:01:42.179] pymongo\asynchronous\cursor.py:1210: in _refresh [2025/02/27 10:01:42.179] await self._send_message(q) [2025/02/27 10:01:42.179] pymongo\asynchronous\cursor.py:1104: in _send_message [2025/02/27 10:01:42.179] response = await client._run_operation( [2025/02/27 10:01:42.179] pymongo\_csot.py:109: in csot_wrapper [2025/02/27 10:01:42.179] return await func(self, *args, **kwargs) [2025/02/27 10:01:42.179] pymongo\asynchronous\mongo_client.py:1763: in _run_operation [2025/02/27 10:01:42.179] return await self._retryable_read( [2025/02/27 10:01:42.179] pymongo\asynchronous\mongo_client.py:1872: in _retryable_read [2025/02/27 10:01:42.179] return await self._retry_internal( [2025/02/27 10:01:42.179] pymongo\_csot.py:109: in csot_wrapper [2025/02/27 10:01:42.179] return await func(self, *args, **kwargs) [2025/02/27 10:01:42.179] pymongo\asynchronous\mongo_client.py:1828: in _retry_internal [2025/02/27 10:01:42.179] return await _ClientConnectionRetryable( [2025/02/27 10:01:42.179] pymongo\asynchronous\mongo_client.py:2583: in run [2025/02/27 10:01:42.179] return await self._read() if self._is_read else await self._write() [2025/02/27 10:01:42.179] pymongo\asynchronous\mongo_client.py:2739: in _read [2025/02/27 10:01:42.179] return await self._func(self._session, self._server, conn, read_pref) # type: ignore [2025/02/27 10:01:42.179] pymongo\asynchronous\mongo_client.py:1754: in _cmd [2025/02/27 10:01:42.179] return await server.run_operation( [2025/02/27 10:01:42.179] pymongo\asynchronous\helpers.py:47: in inner [2025/02/27 10:01:42.179] return await func(*args, **kwargs) [2025/02/27 10:01:42.179] pymongo\asynchronous\server.py:208: in run_operation [2025/02/27 10:01:42.179] reply = await conn.receive_message(request_id) [2025/02/27 10:01:42.179] pymongo\asynchronous\pool.py:595: in receive_message [2025/02/27 10:01:42.179] self._raise_connection_failure(error) [2025/02/27 10:01:42.179] pymongo\asynchronous\pool.py:591: in receive_message [2025/02/27 10:01:42.179] return await receive_message(self, request_id, self.max_message_size) [2025/02/27 10:01:42.179] pymongo\asynchronous\network.py:324: in receive_message [2025/02/27 10:01:42.179] length, _, response_to, op_code = _UNPACK_HEADER(await async_receive_data(conn, 16, deadline)) [2025/02/27 10:01:42.179] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ [2025/02/27 10:01:42.179] conn = AsyncConnection(<ssl.SSLSocket [closed] fd=-1, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0>) CLOSED at 2045221770000 [2025/02/27 10:01:42.179] length = 16, deadline = None [2025/02/27 10:01:42.179] async def async_receive_data( [2025/02/27 10:01:42.179] conn: AsyncConnection, length: int, deadline: Optional[float] [2025/02/27 10:01:42.179] ) -> memoryview: [2025/02/27 10:01:42.179] sock = conn.conn [2025/02/27 10:01:42.179] sock_timeout = sock.gettimeout() [2025/02/27 10:01:42.179] timeout: Optional[Union[float, int]] [2025/02/27 10:01:42.179] if deadline: [2025/02/27 10:01:42.179] # When the timeout has expired perform one final check to [2025/02/27 10:01:42.179] # see if the socket is readable. This helps avoid spurious [2025/02/27 10:01:42.179] # timeouts on AWS Lambda and other FaaS environments. [2025/02/27 10:01:42.179] timeout = max(deadline - time.monotonic(), 0) [2025/02/27 10:01:42.179] else: [2025/02/27 10:01:42.179] timeout = sock_timeout [2025/02/27 10:01:42.179] [2025/02/27 10:01:42.179] sock.settimeout(0.0) [2025/02/27 10:01:42.179] loop = asyncio.get_running_loop() [2025/02/27 10:01:42.179] cancellation_task = create_task(_poll_cancellation(conn)) [2025/02/27 10:01:42.179] try: [2025/02/27 10:01:42.179] if _HAVE_SSL and isinstance(sock, (SSLSocket, _sslConn)): [2025/02/27 10:01:42.179] read_task = create_task(_async_receive_ssl(sock, length, loop)) # type: ignore[arg-type] [2025/02/27 10:01:42.179] else: [2025/02/27 10:01:42.179] read_task = create_task(_async_receive(sock, length, loop)) # type: ignore[arg-type] [2025/02/27 10:01:42.179] tasks = [read_task, cancellation_task] [2025/02/27 10:01:42.179] try: [2025/02/27 10:01:42.179] done, pending = await asyncio.wait( [2025/02/27 10:01:42.179] tasks, timeout=timeout, return_when=asyncio.FIRST_COMPLETED [2025/02/27 10:01:42.179] ) [2025/02/27 10:01:42.179] for task in pending: [2025/02/27 10:01:42.179] task.cancel() [2025/02/27 10:01:42.179] if pending: [2025/02/27 10:01:42.179] await asyncio.wait(pending) [2025/02/27 10:01:42.179] if len(done) == 0: [2025/02/27 10:01:42.179] raise socket.timeout("timed out") [2025/02/27 10:01:42.179] if read_task in done: [2025/02/27 10:01:42.179] return read_task.result() [2025/02/27 10:01:42.179] print("RIGHT BEFORE OPERATION CANCELLED IS RAISED") [2025/02/27 10:01:42.179] print(cancellation_task in done) [2025/02/27 10:01:42.179] > raise _OperationCancelled("operation cancelled") [2025/02/27 10:01:42.179] E pymongo.errors._OperationCancelled: operation cancelled [2025/02/27 10:01:42.179] pymongo\network_layer.py:287: _OperationCancelled
The test expects the following order of events (eliminating some events at the beginning that seem to be behaving properly): configure fail point, start find command, find command fails, start find command again (this is a retry), find command succeeds.
First off, these tests have only been noticed to be flakey on Windows python3.9 on later server (mainly >7.0) versions.
The first find command does successfully fail, however the second find also fails with the same _OperationCancelled Error.
Prior to the _OperationCancelled, there is a heartbeat failure that causes the pool to be cleared and thus cancelling the operation. The test expected all the failPoints to be consumed prior to the retry. However, lowering the failPoint `times` still resulted in the same error.
Notably, we've noticed that there was a heartbeat failure, followed by a heartbeat success, and then another heartbeat failure on the same connection.
Additionally, I've attached the some log files
- output.log is the original test log
- output_failed2.log is the test logs when failPoint is configured to fail twice.
- is related to
-
PYTHON-4493 Use asyncio protocols instead of sockets for network IO
-
- Closed
-
- related to
-
PYTHON-5174 Test failure - test_Driver_extends_timeout_while_streaming
-
- Investigating
-