Uploaded image for project: 'Drivers'
  1. Drivers
  2. DRIVERS-2748

Limit max simultaneous Astrolabe Atlas (cloud-qa) clusters

    • Type: Icon: Improvement Improvement
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Component/s: Astrolabe, Atlas Testing
    • None
    • Not Needed

      Summary

      Astrolabe creates Atlas clusters in the "cloud-qa" environment to test driver behavior during Atlas planned maintenance tasks. The Atlas "cloud-qa" environment limits the max number of simultaneous clusters we can create (max of 25; TODO: Figure out the real number?), and will either return an API error or take a very long time to create clusters (> 1 hour) if that number is exceeded. To prevent hitting that limit, we need to limit how many Astrolabe Evergreen tasks that create Atlas clusters can run in parallel.

      Previously, we used the Evergreen distro "ubuntu1804-drivers-atlas-testing" that was specifically configured to only allow a max of 25 (TODO: Figure out the real number). However, there are two problems with that approach:

      1. That distro seems to have silently lost the important "max hosts" configuration at some point. That could happen again.
      2. We don't want to create special Evergreen distros every time we need to change to a different base distro.

      Instead, we can use

      1. Tag sets of tasks and make that set of tasks dependent on other setes of tasks. For example, label all of the Atlas tasks per driver language (e.g. "atlas-go", "atlas-java", "atlas-python", etc) and make each language group run after the previous language group (e.g. "atlas-go" depends on "atlas-java", "atlas-java" depends on "atlas-python", etc). That would effectively limit the parallelism to ~14 (which is how many Atlas tasks there are per driver).
        See an example of how to create tag dependencies here.
      2. Put all Atlas tasks in an Evergreen task_group. Configure the task_group with max_hosts to limit parallelism (e.g. set it to 25). Note that a task_group changes the task run environment slightly, so could cause unexpected problems.

      Motivation

      Who is the affected end user?

      Who are the stakeholders?

      How does this affect the end user?

      Are they blocked? Are they annoyed? Are they confused?

      How likely is it that this problem or use case will occur?

      Main path? Edge case?

      If the problem does occur, what are the consequences and how severe are they?

      Minor annoyance at a log message? Performance concern? Outage/unavailability? Failover can't complete?

      Is this issue urgent?

      Does this ticket have a required timeline? What is it?

      Is this ticket required by a downstream team?

      Needed by e.g. Atlas, Shell, Compass?

      Is this ticket only for tests?

      Does this ticket have any functional impact, or is it just test improvements?

      Acceptance Criteria

      • Astrolabe Evergreen builds limit the maximum simultaneous number of Atlas "cloud-qa" clusters to 25 or less.
      • Astrolabe Evergreen builds still complete in a reasonable period of time (< 4 hours).
      • Ideally, only "atlas-" tasks should be limited; "kind-" tasks should not be impacted.

            Assignee:
            steve.silvester@mongodb.com Steve Silvester
            Reporter:
            matt.dale@mongodb.com Matt Dale
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: