-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: Distributed Query Execution
-
None
-
Query Execution
-
Fully Compatible
-
ALL
-
QE 2025-11-24
-
0
-
None
-
None
-
None
-
None
-
None
-
None
-
None
BF-40621 contains a test failure in jstests/concurrency/fsm_workloads/query/agg/agg_unionWith_interrupt_cleanup.js in which a cursor remained open for 10 minutes during the test's teardown.
In the teardown implementation in the test, we schedule a kill cursors operation once for every cursor that is still present, and then repeatedly poll the list of open cursors until there are no more cursors left. If this poll loop still finds at least one open cursor for 10 minutes, the test fails. This is what happened in BF-40621.
The logs indicate that the cursor was created by the test itself, before the teardown. It is unclear why the cursor was not initially killed in the teardown function, but some code comments in the test mention that there can be race conditions or network blips in which the kill operations won't arrive as expected.
In order to increase the stability of this test, we should add a retry mechanism to the poll loop in the teardown function, so that whenever it detects that there are still open cursors present, it will try to kill these again. This should make the teardown more reliable in case of transient network issues or races.