When cache utilization hits 95% performance falls off a cliff, severely impacting production.
If the solution to this isn't to (gently) keep utilization from hitting 95%, then do we need to look at why threads getting involved in evictions at 95% is so impactful? Note that on in the incident on the primary that Bruce Lucas analyzed it appeared to me that the shortfall between evictions required to keep the cache steady and actual evictions was only 0.5%, yet the impact on operation rates and latencies to get application threads involved in evictions seemed far out of proportion to the shortfall that they had to make up.
If on the other hand evictions are really so fundamentally difficult that increasing eviction rate by 0.5% is hard, does it make sense to look at it from the other end, throttling application threads by the 0.5% required (in this example) to make up the shortfall by very slightly reducing rate of pages read into cache? A similar analysis of the lag incident on the secondary showed that the shortfall was about 9%, yet making up that shortfall when the cache hit 95% utilization nearly brings replication to a halt for extended periods.