-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: Distributed Query Execution
-
None
-
Query Execution
-
Fully Compatible
-
ALL
-
QE 2025-02-03
-
200
-
None
-
None
-
None
-
None
-
None
-
None
-
None
The error handling in the AsyncResultsMerger currently is inconsistent.
The AsyncResultsMerger maintains a _status member, which can only transition from OK to non-OK. It also maintains a separate status member on every remote that it manages.
The status instance member is _updated in the _ready() member function if any of the remotes has its status member to set to any error.
_status is then checked in nextReady(). If _status contains an error, the nextReady() method correctly returns it.
The individual remotes' status member is currently updated in multiple locations in the code. Some locations afterwards call the _cleanUpFailedBatch() method, which can then reset the remote's status from non-OK to OK if partial results are allowed.
But some locations currently do not call the cleanup method.
Instead the _scheduleGetMores() method could call _cleanUpFailedBatch() when more data needed to be fetched from a remote. But it would then return and not send any getMores.
The whole approach has multiple issues.
1. Cleanup of errors is currently inconsistent. Some locations call the _cleanUpFailedBatch() method if a remote runs into an error. Some other locations don't.
2. Cleanup of errors can currently be executed when scheduling getMore requests in _scheduleGetMores(), but that function returns early even if the error was cleaned up.
3. The issue uncovered by BF-36336, which is described in more detail following.
Clients normally use the AsyncResultsMerger like this:
while (!arm.ready()) { // block until results are ready. ... } nextResult = arm.nextReady();
This works fine if the sequence of events is:
1. An error response for a remote is received, which sets the remote's status value to that error.
2. User calls ready(). This iterates over all remotes, finds the error and sets the global _status to that error as well. It also returns true so that the user can call nextReady().
3. User calls nextReady(), which checks _status and returns the error correctly.
However, there is the potential for a race if a remote returns an error between the calls to ready() and nextReady(). If this happens, ready() returns true but does not update _status. nextReady() then checks _status, which is still set to OK. It then calls into the appropriate subfunction to return the next result, which expect the remote not to have an error status set. These functions will then tassert or fail with invariant failures.
The sequence of events which currently does not work is:
1. User calls ready(). Ready returns true because there is some data available. _status is still OK.
2. A response for a remote is received, which sets the remote's status value to that error.
3. User calls nextReady(), which checks _status (not yet set) and calls into the appropriate subfunction to retrieve the next result. This triggers a tassert or invariant failure, because it assumes that the remote does not have an error set in its status variable.
The error handling in the AsyncResultsMerger should be improved to address all 3 issues mentioned above.