[SERVER-31112] Create dedicated "slow" machine evergreen variants Created: 15/Sep/17  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor - P4
Reporter: William Schultz (Inactive) Assignee: Backlog - Server Tooling and Methods (STM) (Inactive)
Resolution: Unresolved Votes: 0
Labels: stm
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Server Tooling & Methods
Participants:

 Description   

For teams like sharding and replication, Javascript test reliability is often dependent on the hardware speed of the machines they run on, since there are multiple nodes that are in communication with each other. In the past, slow machines have exposed either flaky tests or actual bugs that wouldn't manifest themselves on a high performance machine. It seems that having a dedicated set of variants, with controllable levels of "slowness" could be a useful part of our test infrastructure. The slowness parameters could include disk, network, CPU, etc, with potentially separate variants for different types. This could expose our system and tests to varying types of stress that may or may not be explicitly tested currently. Tests which are dependent on timing and machine speed could likely be ignored by such a variant.

The main goals of these "slow" variants would be the following:

1. Expose Test Flakiness: Provide stronger and more explicit verification that tests aren't "flaky". That is, tests that shouldn't be dependent on machine speed should not fail due to a machine speed issue.
2. Expose Timing Dependent Server Bugs: Provide a more efficient and potentially reproducible way of exposing bugs in the server that only manifest as a result of non-standard system conditions i.e. extremely slow network, disk, CPU.

To achieve the above two goals, we would likely need to determine which of our tests are to be considered "timing-agnostic", and run only those tests on the slow variants, so that we don't produce extra noise from tests that are timing dependent. If these "timing-agnostic" tests truly are valid tests, then they should never fail due to criteria 1 and 2 noted above. That is, they are not flaky, and there are no timing dependent bugs that they would ever expose.

If these slow variants were also integrated into patch build workflows from an early stage, they could act as an extra guard against tests that may introduce flakiness or intermittent failures into the Evergreen master branch.



 Comments   
Comment by Steven Vannelli [ 10/May/22 ]

Moving this ticket to the Backlog and removing the "Backlog" fixVersion as per our latest policy for using fixVersions.

Comment by William Schultz (Inactive) [ 15/Sep/17 ]

max.hirschhorn Do the above goals seem rational to you?

Comment by Max Hirschhorn [ 15/Sep/17 ]

Scott had an idea around doing this in SERVER-21558 with controlling CPU resources. william.schultz, could you clarify what you view the goals of this new Evergreen build variant would be? Is it (a) that tests shouldn't fail even when run on a "slow" host, (b) that we identify tests which can fail when run on a "slow" host, or (c) something else?

Generated at Thu Feb 08 04:26:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.