If the memory mapped pages of a file are not present in RAM then a page fault is taken to get the contents. If there is concurrency and the requests do not have much in common then the situation gets worse as many (random) different areas will be taking page faults which causes lots of disk accesses including random seeks dropping throughput considerably. As this slows down completion of all executing requests, it increases the chance of another request coming in and if that starts executing, it will make things even worse. What you end up seeing is an essentially idle CPU, the I/O subsystem at 100% capacity and hard disks seeking their hearts out.
The consequence is that when MongoDB hits this capacity limit, performance falls off a cliff. There are several things that can be done to correct this:
- Reduce concurrency as saturation is approached to let requests complete quicker instead of having lots of slow very long running requests
- Under POSIX the madvise system call can be used. For example if an index or data is being sequentially read you could use madvise MADV_SEQUENTIAL|MADV_WILLNEED to suggest the kernel fill those pages in. You can use MADV_DONTNEED on pages that won't be needed again in the near future, as that will help the kernel determine which pages can be evicted to make space for new ones.
- You can use the mincore system call to determine if a page fault will be taken for a memory range. This is probably the best test for available concurrency (ie throttle how often you proceed when it returns false)