When a large number of events queue up, Evergreen processes events multiple times, which reduces the efficiency of the system. For example, during an outage today, Splunk indicates that we processed 300k events, but only 95k distinct events, meaning each event was processed 3 times on average. Other useful statistics indicate that the event notifier jobs used a median batch size of 1k.
To reduce this waste, we could try changing the event notifier to be a job that evaluates a single event (or small batch of events) and use scopes on the events being processed by that job. This would eliminate the duplicated work done. In order to handle the large number of jobs that would run, it would be necessary to move it into a queue group. Along with moving it to a queue group, we should also set a sample size to reduce contention for the head of the queue.
There some also be a bit of preliminary load testing in staging to see how many events the queue group can process at a given time.