During one 10-minute run of a heavy mixed workload observed a 60-second stall, apparently coinciding with the time between the end of one checkpoint and the start of another.
- From A to B throughput drops to near 0.
- mongod log shows a handful of ops completing throughout this period, with increasing latencies suggesting that they have been waiting since A.
- page acquire time sleeping suggests most threads (about 93 out of 100) are waiting for access to pages
- throughout this period 40 pages per second are being evicted because they exceeded in-memory maximum
- yet cache statistics show nothing leaving the cache and no change in cache sizes during this period
- at the end of the period about 2500 failed evictions are reported within 1 second. This is about the same number as the number of pages evicted during that period, i.e. 60 seconds * 40 pages / second. Is that a coincidence, or are the failed evictions reported at the end of the period the same evictions that were reported througout the period?
- the 60-second stall appears to coinicide with the time between the end of one checkpoint and the start of the next.