-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: None
-
None
-
Cluster Scalability
-
Fully Compatible
-
ALL
-
ClusterScalability 8Jun-22June
-
None
-
None
-
None
-
None
-
None
-
None
-
None
CursorEstablisher::_prioritizeFailures decides which failure to surface when multiple remotes fail with distinct errors. It is possible for it to return a CallbackCanceled ("Baton wait canceled") — produced when a retry backoff sleep is cancelled — in preference to a recoverable retargeting error such as
StaleConfig.
This is the same class of bug as SERVER-123598, which stopped CallbackCanceled from clobbering a stored error, but that fix deliberately excluded retargeting errors (&& !_currentErrorIsRetargetingError()), leaving them exposed.
Example sequence:
- CursorEstablisher sends two requests
- First request is rejected by the shard's rate limiter (SystemOverloadedError), inducing a backoff sleep
- Second request fails with StaleConfig, which is stored and triggers stopRetrying()
- stopRetrying() cancels the first request's backoff, yielding CallbackCanceled
- Final error returned is CallbackCanceled instead of StaleConfig
Because CallbackCanceled is in the CancellationError category (not RetriableError), the recoverable StaleConfig refresh-and-retry path is lost and the operation fails with a non-retriable error. This was observed in the ingress-rate-limiter passthrough suites (where overload retries with backoff are
common), surfacing to clients as e.g. "Executor error during getMore :: caused by :: Baton wait canceled" on $lookup getMores.
A possible fix is to always prefer an existing error over a CallbackCanceled error, ie remove the !_currentErrorIsRetargetingError() exception.
- blocks
-
SERVER-114130 Move the failRateLimiting fail point to the ingress request rate limiter in session workflow
-
- Closed
-
- is related to
-
SERVER-123598 CursorEstablisher returns retry cancellation over original failure
-
- Closed
-