-
Type:
Improvement
-
Resolution: Unresolved
-
Priority:
Unknown
-
None
-
Component/s: Astrolabe, Atlas Testing
-
None
-
Not Needed
Summary
The Astrolabe Atlas API client sometimes doesn't retry when there are errors calling the API. Cases we've observed that caused a task failure include:
- The API returned an incomplete JSON blob and Astrolabe failed to parse it.
- The API returned an error message that indicated an intermittent API error, but misused the HTTP status code 400.
There is currently logic that retries API requests, but it only retries if there is an error getting a response. Instead, we should retry all requests that don't return a parseable API message, independent of HTTP code. There is a risk that we could retry a request that will never succeed, but Astrolabe uses static concurrency and generally doesn't cause a ton of API requests, so the possibility of unnecessary retries are better than unnecessary failures.
Motivation
Who is the affected end user?
Astrolabe maintainers and DBX devs.
How does this affect the end user?
Astrolabe maintainers need to manually restart jobs, which takes up time. DBX devs have to sift through the noise of intermittent failures, which obscures the real test data.
How likely is it that this problem or use case will occur?
The cloud-qa Atlas env is intermittently unstable. The API failures tend to happen a few times a month on average.
If the problem does occur, what are the consequences and how severe are they?
Wasted time and obscured test results.
Is this issue urgent?
No.
Is this ticket required by a downstream team?
No.
Is this ticket only for tests?
No.
Acceptance Criteria
What specific requirements must be met to consider the design phase complete?