-
Type: Improvement
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Build
-
151
We need a failover option when EngFlow has issues
- We discussed the option of trying a local build when remote execution fails, this seems like the best option even though it has drawbacks
- We need to investigate if we can differentiate between a build correctness failure (ex. compiler error) and a remote infra issue
- We don't believe we can distinguish, so we'll probably need to just always retry locally if anything ever fails. This has the drawback of longer delays on build failures and potentially more confusing error logs, but is probably better to be safe on availability
- This could cause issues with large slowdowns since local execution is going to be very slow. There isn't really a way around this other than putting up our own backup remote cache system, which is likely prohibitively expensive (although maybe we can do it for critical variants?)
- We need to make sure we have active alerting setup to notify when there is a discrepancy between local and remote build success. We don't want remote builds to be failing silently and have everything slowdown without us knowing about it
This ticket covers the work for always retrying a bazel invocation in local-mode when the first remote invocation fails, and adding a loud alert mechanism for when a remote build fails, but a local build succeeds.