My test run was standalone; everything else is as described above (12 core / 24 cpu machine, 64 GB mem, 24 threads of the workload). Any differences in throughput from my tests presumably are due to some difference in configuration or test parameters.
In particular my test was entirely CPU bound: the 20 GB capped collection fits in the default 32 GB cache. You can also see this from the stack traces I attached: when deletions begin, 24 threads spend much of their time in pthread_mutex_timedlock called from cappedDeleteAsNeeded, waiting for a single thread to do the deletions, and that single thread is clearly CPU-bound - spending almost all its time in __wt_tree_walk, and much of that time in __wt_evict, with almost no i/o among those stack traces. So it seems if it were possible to eliminate the evictions that would provide improvement.
I also wonder if it might be possible to improve things by allowing more than one thread to do the deletions - for example, n threads each deleting every nth record of a range, to provide a better match for the n threads doing insertion. TBD I guess how much contention would limit the speedup.