-
Type:
Task
-
Resolution: Unresolved
-
Priority:
Unknown
-
None
-
Component/s: Retryability
-
None
-
Needed
Summary
Make system overload errors easier to diagnose. When the system overload retry loop decides to short circuit a retry or hits a non-retryable error, it should be easy to diagnose why that decision was made. Ideally, we could answer the following questions:
- Was the last error retryable or non-retryable? And why? EG did the mongos return an overload error without the retryable errorLabel?
- If the last error was retryable, why was a retry not performed?
- Was the retry budget depleted (DRIVERS-3240)? Did we hit the max retry attempts?
- How long did the failed operation take (including all retries)?
Motivation
Who is the affected end user?
Any user or engineer. This kind of debugging info will be helpful to diagnose support cases where users encounter system overload errors.
How does this affect the end user?
Without this users will see a generic "SystemOverload" error and will not be able to determine if that error was from the initial attempt, a retry, or why another retry was not attempted.
How likely is it that this problem or use case will occur?
Common.
If the problem does occur, what are the consequences and how severe are they?
Delays support cases.
Is this issue urgent?
Initially I will include it as a goal for DRIVERS-3160 but it can also be completed as a follow up change.
Is this ticket required by a downstream team?
No.
Is this ticket only for tests?
No.
Acceptance Criteria
Overload errors returned by the driver should include actionable information that can be used to answer the questions above.
- is related to
-
DRIVERS-3240 Adaptive token bucket retry policy
-
- Backlog
-