Before expanding Cedar's participation in data collection for Evergreen, we should fully understand the consequences of increased load—the "where" and "why" of issues that arise. The current process of enabling Cedar features in different Evergreen or Evergreen-related subsystems and reverting as problems arise has been painstakingly slow and disruptive. Before enabling Cedar services in production, we should be as sure as we can possibly be that these services will work as expected and continue to do so under increased load.
The main objectives of this investigation are to (1) fully understand the impact of increasing specific types of load (such as logging versus test results) and (2) figure out the next steps (for example, if this an architectural issue, maybe the data storage format is just not feasible and we will need to revisit its design entirely).
Next steps (in increasing order of difficulty):
(1) Reduce database connections by caching application-wide, or "environment", values such as the application configuration and service's user username and apikey. (
(2) Investigate any other areas where we can reduce database connections. (
(3) Consider adding more service flags to quickly disable Cedar when something goes wrong in prod (this is mostly for resmoke where disabling requires a code change in the evergreen.yml). (EVG-15616)
(3) Investigate the driver configuration. Are we do anything silly here, like unnecessarily limiting the max number of connections? (
(4) Investigate potential DB bottlenecks. (PRODTRIAGE-2133)
(5) Stress testing in staging. While this will be the most time-consuming, it is arguably the most important since it will enable us to iterate quickly and give us the proper resources to analyze and understand real-life load impacts. (