Uploaded image for project: 'Evergreen'
  1. Evergreen
  2. EVG-16216

Batch requests to TerminateInstances



    • Task
    • Status: Closed
    • Major - P3
    • Resolution: Won't Do
    • None
    • None
    • plt


      Currently, hosts are terminated in AWS using an individual TerminateInstances call in the host termination job for each host. When there are many hosts to terminate (on the order of 1000s of hosts), the request rate limit is easily exceeded. This can happen due to any event that would require many hosts to terminate, such as host drawdown, which batch decommissions hosts.

      According to the rate limiting docs, we have two limiting factors to terminating AWS instances:

      • Request rate limits (i.e. how many TerminateInstances calls we can make)
      • Resource rate limits (i.e. how many instances can be terminated per unit time).

      Batching hosts to terminate into larger TerminateInstances calls will reduce the request rate limit (max 200 resource-mutating requests, which refills at a rate of 2 requests/min) but we still cannot exceed the resource rate limit (max 1k hosts terminated at a time, which refills at a rate of 20 terminations/min). Whether or not we account for the resource rate limit when batching hosts to terminate is up for debate, but as a first step, we should at least try batch terminating as many hosts as possible in a single TerminateInstances call.

      One way to implement this would be to split the host termination job into two phases. The first phase does the Evergreen-side cleanup, such as resetting stranded tasks, which then sets the host to an intermediate state ("terminating") to indicate that Evergreen-side cleanup is done, but the underlying instance still needs to be terminated in the cloud. The second phase of termination would be a separate job that actually terminates the hosts in AWS in batches.


        Issue Links



              backlog-server-evg Backlog - Evergreen Team
              kimberly.tao@mongodb.com Kim Tao
              0 Vote for this issue
              3 Start watching this issue