[DRIVERS-1811] Astrolabe Testing Improvements Created: 16/Jun/21  Updated: 15/Feb/23  Resolved: 07/Jan/22

Status: Closed
Project: Drivers
Component/s: None
Fix Version/s: None

Type: Epic Priority: Unknown
Reporter: Rachelle Palmer Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
is duplicated by DRIVERS-1576 Testing the efficacy of Astrolabe Closed
is duplicated by DRIVERS-1745 Atlas Planned Maintenance Testing Sta... Closed
Initiative
Driver Changes: Not Needed
Quarter: FY22Q4
Downstream Changes Summary:

NA

Detailed Project Statuses:

Engineer(s): Oleg, Jeff

Summary: Make improvements to Astrolabe testing in order to be sure we are receiving the full benefit of the tool. This project captures the effort to increase stability of Astrolabe as well as verify the efficacy of its testing such that it is accurately surfacing driver bugs.

2021-12-01:
No updates since Oleg returned from vacation earlier this week


2021-11-17:

  • On pause while Oleg is on vacation

2021-10-06:

  • Completed DRIVERS-1924: Retrieve server logs only when tests fail
  • Completed DRIVERS-1923: Retrieve server logs in a separate task
  • Deferred DRIVERS-1691: Research ways to avoid OutOfMemory exception when we handle huge batch of events
  • Started DRIVERS-1932: When Astrolabe run fails due to issue with Atlas, color it lavender to indicate a setup failure rather than a task failure
  • Next up: RCA on the regularly occurring timeouts in Atlas QA. Cory is going to help with triage in #astrolabe-triage Slack channel. Eventually will add the on-call leads to assist


 Description   

Summary

We need to make improvements to Astrolabe testing in order to be sure we are receiving the full benefit of the tool. This project captures the effort to increase stability of Astrolabe as well as verify the efficacy of its testing such that it is accurately surfacing driver bugs.

Efficacy Improvements

The astrolabe project currently tests various Atlas planned maintenance scenarios in an attempt to find problematic scenarios indicating either driver bugs or bugs in the planned maintenance scenarios themselves. So far the project has not found bugs in either (aside from timeout issues in cloud-dev).

We need to test Astrolabe itself to see if it actually achieves its goals. One way to do this would be to test a driver version that is known to have bugs related to planned maintenance, along with the workload that was used to reproduce the bug, and see if those bugs are reproduced by Astrolabe. We would want to continually test Astrolabe this way to ensure future changes to it don't obscure known driver bugs. As new driver bugs are found the pre-bugfix version of the driver, and the workload that reproduces the bug, should be added to a corpus of such tests.

Stability Improvements

The Atlas Planned Maintenance test suite should only be red due to a driver, server or Atlas bug. Not because of an Astrolabe bug, or cloud-dev/cloud-qa instability. The Atlas Planned Maintenance tests have a number of stability issues that obscure the problems the test suite was intended to find, reducing or eliminating its value:

  • Tests often time out waiting for maintenance to complete
  • Tests often time out attempting to download logs
  • Tests sometimes OOM tracking APM events
  • etc. etc. etc.

A policy for triage of testing failures

  • Automatically notify language teams when Astrolabe failures occur for their driver
  • Define a process for handing off triage to other teams (server, cloud, etc.) when triage determines there is no driver bug

Motivation

Who is the affected end user?

No end users will be affected by this work as it is internal testing. Driver engineers are essentially the end users here and they may need to make changes to their atlas testing to accommodate updates that come out of this project.

Is this issue urgent?

This ticket is urgent because we must have a strong functional validation mechanism between drivers and Atlas. This only becomes more urgent over time as Atlas functionality expands and usage continues to grow.

Is this ticket required by a downstream team?

Not functionally, but our testing with Astrolabe is an essential verification mechanism between drivers and Atlas, so it is implicitly required cross-org.

Is this ticket only for tests?

Sort of - this project is solely about testing, however it involves making changes to Astrolabe and is more significant in urgency and scope than simply adding tests.

Cast of Characters

Engineering Lead:
Document Author:
POCers:
Product Owner:
Program Manager:
Stakeholders:

Channels & Docs

Running List of Astrolabe Issues

Slack Channel

[Scope Document|some.url]

[Technical Design Document|some.url]



 Comments   
Comment by Bernie Hackett [ 09/Jul/21 ]

This is an epic that was intended to encompass all of the Astrolabe projects in the backlog. I think we just haven't done the proper cleanup yet.

Comment by Kaitlin Mahar [ 09/Jul/21 ]

What is this ticket about? Is this related to (or is there a dependency on) DRIVERS-1576?

Generated at Thu Feb 08 08:24:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.