This potentially could fall under https://jira.mongodb.org/browse/PM-2005 but I thought I would create a ticket outlining my issues so that there is a record of the use case.
We use Evergreen for PR testing. Sometimes Evergreen fails to test our PRs due to internal errors. When this happens, the github status only says "Evergreen error" with no error description, or a link to obtain such error information.
We then need to inquire in slack channel as to what the problem is. Sometimes we are told that the problem is already resolved by the time we inquire.
I haven't been keeping records of the inquiries and responses but it feels to me that in many cases these problems are not proactively communicated to Evergreen users, i.e. I have to inquire even to be told that the problem is resolved - there often isn't a notification that there was a problem, and that it was repaired.
Lastly, we generally need to re-push the commits to github to get them tested by evergreen.
I would like to request the following improvements to the process as described above:
1. When evergreen sends the "Evergreen error" status to github, include context about what the error is. I understand that at some level the errors may not be exposable due to e.g. security concerns, but at least the subsystem that failed can surely be identified. For example, "build could not be scheduled", "PR information could not be retrieved from github", etc.
2. The "Evergreen error" status should link somewhere. It could link to the specific failure. It could alternatively link to a general "Evergreen status" page which would eventually contain an update when the problem was identified and subsequently another update when the problem was fixed. Essentially by clicking the "Evergreen error" status I should be able to figure out whether there was some fix that was likely to fix this error already performed & deployed.
3. There should be a general "Evergreen status" page where all problems with Evergreen are posted. I understand that currently there is a banner that can be turned on which is used for some problems; the status page, in my opinion, should include all problems as soon as an Evergreen team member is notified about a potential problem and begins investigating (for example, using language similar to "We are investigating reports of higher than usual error rates in ..."). The information on this page should also be available via the API.
4. When a problem is identified, corrected or investigation concludes, the status page should be updated appropriately. When a deploy is performed to correct a problem, the update should happen at the end of the deploy when the problem is completely resolved.
5. Whenever a problem that caused "Evergreen error" github status is resolved, evergreen should automatically restart recent (say, age < 6/12 hours) failed builds. If the failed builds are not recorded in evergreen's database, upon problem resolution all projects in evergreen that have PR testing on should be crawled for PRs made in the recent window which have the PR status set to "Evergreen error", and those PRs should be automatically retested without users having to re-push commits.