One intent host (evg-rhel80-small-20220106162511-3162854204370114620) is making it to the part of the host termination job that describes the cloud instance, even though the check above it is supposed to exit early for intent hosts.
This host still appears to be alive, because it's updating its last communciated time. It seems possible that the host was marked as building failed because the host is not supplying an instance ID with its next task request. In that case, it would be unable to terminate the host (due to lack of instance ID), but the host would remain alive.
Some small changes to the agent and host life cycle would likely improve this:
- The agent should re-attempt to get its instance ID if it failed to do it initially. It might be nice to log the error to Splunk when the agent fails to get an instance ID.
- The host termination job should handle intent host IDs with terminated status idempotently by no-oping them, since the job can't terminate the cloud instance.
- Alert at a higher level to monitor hosts that are were building, successfully started an agent, but failed to send an instance ID to next task.