alexander.gorrod, something that occurred to me when haribabu.kommi and I were working on the new sync code is that the changes made to support checkpoint's garbage collection of the history store (identifying WT_REFs by type in all cases and reworking the WT_REF locking), were applicable to file compaction as well.
Currently compaction has a big weakness: it walks the tree, and for every leaf page in the file it checks to see if the page should be rewritten. If the page should be rewritten and isn't in cache, it's read into the cache, marked dirty and then written by normal eviction. This implies a couple of bad things: first, compaction spikes cache use and makes eviction less effective, and second, we're wasting a lot of work to read a page, convert it to it's in-memory representation, reconcile the page and finally write the page (for example, checksumming, encryption and compression on both the read and write paths).
Using the same techniques as we're using in the new checkpoint code, we could change things so pages that aren't in memory are rewritten in a single call to the block manager, avoiding all the bad behavior, and probably improving our compression story since we won't have to wait for eviction to write the page.
Additionally, the compaction code quits if eviction is struggling, and that's a pain point that comes up every now and then, where a MongoDB installation can't get compaction working because eviction is stressed too much, so compaction starts up and then fails without making progress.
I took a look at this today, and think this approach will work. I'd estimate about two days of work, plus possible testing fallout, of course.