-
Type: Epic
-
Resolution: Done
-
Priority: Unknown
-
None
-
Component/s: None
-
None
Summary
We need to make improvements to Astrolabe testing in order to be sure we are receiving the full benefit of the tool. This project captures the effort to increase stability of Astrolabe as well as verify the efficacy of its testing such that it is accurately surfacing driver bugs.
Efficacy Improvements
The astrolabe project currently tests various Atlas planned maintenance scenarios in an attempt to find problematic scenarios indicating either driver bugs or bugs in the planned maintenance scenarios themselves. So far the project has not found bugs in either (aside from timeout issues in cloud-dev).
We need to test Astrolabe itself to see if it actually achieves its goals. One way to do this would be to test a driver version that is known to have bugs related to planned maintenance, along with the workload that was used to reproduce the bug, and see if those bugs are reproduced by Astrolabe. We would want to continually test Astrolabe this way to ensure future changes to it don't obscure known driver bugs. As new driver bugs are found the pre-bugfix version of the driver, and the workload that reproduces the bug, should be added to a corpus of such tests.
Stability Improvements
The Atlas Planned Maintenance test suite should only be red due to a driver, server or Atlas bug. Not because of an Astrolabe bug, or cloud-dev/cloud-qa instability. The Atlas Planned Maintenance tests have a number of stability issues that obscure the problems the test suite was intended to find, reducing or eliminating its value:
- Tests often time out waiting for maintenance to complete
- Tests often time out attempting to download logs
- Tests sometimes OOM tracking APM events
- etc. etc. etc.
A policy for triage of testing failures
- Automatically notify language teams when Astrolabe failures occur for their driver
- Define a process for handing off triage to other teams (server, cloud, etc.) when triage determines there is no driver bug
Motivation
Who is the affected end user?
No end users will be affected by this work as it is internal testing. Driver engineers are essentially the end users here and they may need to make changes to their atlas testing to accommodate updates that come out of this project.
Is this issue urgent?
This ticket is urgent because we must have a strong functional validation mechanism between drivers and Atlas. This only becomes more urgent over time as Atlas functionality expands and usage continues to grow.
Is this ticket required by a downstream team?
Not functionally, but our testing with Astrolabe is an essential verification mechanism between drivers and Atlas, so it is implicitly required cross-org.
Is this ticket only for tests?
Sort of - this project is solely about testing, however it involves making changes to Astrolabe and is more significant in urgency and scope than simply adding tests.
Cast of Characters
Engineering Lead:
Document Author:
POCers:
Product Owner:
Program Manager:
Stakeholders:
Channels & Docs
Running List of Astrolabe Issues
Slack Channel
[Scope Document|some.url]
[Technical Design Document|some.url]
- is duplicated by
-
DRIVERS-1576 Testing the efficacy of Astrolabe
- Development Complete
-
DRIVERS-1745 Atlas Planned Maintenance Testing Stability
- Development Complete