The event notifier jobs currently can process an unbounded number of events. Although there is a limit, this is a per-job limit and not a global limit on the total number of events that can be simultaneously processed. This means that if event notifier jobs pile up due to a large number of unprocessed events, they can simultaneously process an unrestricted number of events. It's worth attempting to put a throttle on the total number of events processed simultaneously since the subscriptions aggregation used for event processing is inefficient. This will degrade the timeliness of downstream triggers like user notifications, but may prevent the repeated drops in read tickets that the DB has been experiencing.
However, putting a simple limit on the number of events is somewhat tricky, since it increases the likelihood that duplicate work will be done and the event processor will become even slower. As the number of unprocessed events grows, the likelihood of duplicate work being done increases, which also reduces the inefficiency of the system. For example, these Splunk results show that the number of events processed increased. In the same time range, the number of distinct events that were processed is about 80k less than the total number of events processed (280k). Zooming into the exact time range of the high event processing rate, it seems like the percentage of distinct events processed to total events processed was less than 50%, meaning each event was processed at least twice on average. Therefore, the global event processing limit has to be implemented in such a way that it doesn't increase the duplicate work already done in the event notifier job. Solving this would require also reducing the issue where the event notifier can process events multiple times.
Before putting it in production, it would be good to load test it in staging to ensure it doesn't cause this slowdown and determine what's a reasonable global event processing limit.