Make system overload errors easier to diagnose

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Unresolved
    • Priority: Unknown
    • None
    • Component/s: Retryability
    • None

      Summary

      Make system overload errors easier to diagnose. When the system overload retry loop decides to short circuit a retry or hits a non-retryable error, it should be easy to diagnose why that decision was made. Ideally, we could answer the following questions:

      1. Was the last error retryable or non-retryable? And why? EG did the mongos return an overload error without the retryable errorLabel?
      2. If the last error was retryable, why was a retry not performed?
      3. Was the retry budget depleted (DRIVERS-3240)? Did we hit the max retry attempts?
      4. How long did the failed operation take (including all retries)?

      Motivation

      Who is the affected end user?

      Any user or engineer. This kind of debugging info will be helpful to diagnose support cases where users encounter system overload errors.

      How does this affect the end user?

      Without this users will see a generic "SystemOverload" error and will not be able to determine if that error was from the initial attempt, a retry, or why another retry was not attempted.

      How likely is it that this problem or use case will occur?

      Common.

      If the problem does occur, what are the consequences and how severe are they?

      Delays support cases.

      Is this issue urgent?

      Initially I will include it as a goal for DRIVERS-3160 but it can also be completed as a follow up change.

      Is this ticket required by a downstream team?

      No.

      Is this ticket only for tests?

      No.

      Acceptance Criteria

      Overload errors returned by the driver should include actionable information that can be used to answer the questions above.

              Assignee:
              Unassigned
              Reporter:
              Shane Harvey
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: