Affects Version/s: None
Fix Version/s: WT1.4.2
Here are a few changes to your fixes for eviction during checkpoints (a branch on a branch).
The first change is mostly cosmetic: as far as I can see, there is no need for a round trip to the eviction server thread to wait for LRU eviction to complete before starting a checkpoint. Any pages selected for LRU eviction will be locked while evict_lock is held, and the checkpoint walk will wait for the locked state to be resolved before attempting to evict any parent pages.
The next thing I realized is that your check in __rec_review of whether an ancestor page is in some state other than WT_REF_MEM is sufficient on its own. That is, we don't need to completely avoid eviction of dirty pages anywhere in the file for the whole of the internal node checkpoint.
The reason is a little subtle: as mentioned above, any eviction "ahead of" the checkpoint will cause the checkpoint walk to pause until it is done. Any eviction "behind" the checkpoint will read using the checkpoint transaction snapshot, so no newer changes can be evicted anyway.
The tradeoffs with this change are:
1. application threads can help out with checkpoint in the unlikely event that an internal node ends up on the LRU queue;
2. it simplifies the code (less checking for whether writes are disabled); but
3. there could be more wasted work in eviction threads, where reconciliation starts then gives up when there is no chance of them being evicted successfully.
If 3. is measurable, I'm inclined to address it by tweaking the pages that are put on the LRU queue and keeping the complexity out of __rec_review.
Please take a look and let's talk when I come online in your afternoon / evening. I think we're close – both versions are running hundreds of iterations of that test/format config, and the performance testing is looking pretty good (still a slight dip at the end of checkpoints, but as discussed previously, that could be contention for the block manager mutex).