Priority: Critical - P2
Affects Version/s: 3.2.9
Fix Version/s: None
Note: this may be the same underlying issue as
SERVER-25974, but some of the metrics appear to be different, and this ticket has a specific simple repro not clearly tied to the customer issue seen on SERVER-25974, so opening as a separate ticket for now until/unless we can demonstrate that they are the same issue.
The insert repro workload from
SERVER-20306, also attached to this ticket as repro-32-insert.sh, gets stuck with cache utilization at about 96%:
- problems start at A, pretty much complete stuck at B
- ftdc stalls ("ftdc samples/s") suggest that application threads are sometimes getting stuck for extended periods doing evictions
- application threads seem to be starved for work to do:
- "pages walked for eviction" has gone up but "pages seen by eviction walks" has gone down
- application threads are often finding the queue empty
- pages evicted by application threads is not high
I've also attached stack traces captured during the stuck period, although I don't think they give much insight.