Description
When testing Kubernetes Operators in multi-cloud scenarios, it happens from time to time that our tasks fail due to issues out of our control - networking, timeouts, slow Kubernetes cluster nodes etc. We'd like to request building an automatic rebuild capability to Evergreen. This would significantly increase our productivity (as currently we re-trigger those jobs manually via UI).
Within the Kubernetes Team we discussed a few implementation ideas:
- A Github hook listening on evergreen retry failed command. This scenario could be easily automated by the product teams as a Github Action for example.
- A REST Endpoint on a Patch level that would calculate and re-trigger failed tasks.
- Introduce necessary flags in evergreen.yml to make a task eligible for automatic rebuilds. This approach would help the teams to control their flaky tasks.
It is highly recommended that there's some upper-bound for the number of automatic rebuilds. This would prevent the rebuilds from getting out of control.
xref: https://mongodb.slack.com/archives/C0V896UV8/p1657782556658069