Priority: Major - P3
Affects Version/s: None
Improvements to log slot freeing to improve thread scalability
- Threads are often waiting because there are no FREE slots. Slots are freed by the __log_wrlsn_server. However because it is done asyncronously there may be unnecessary delay in freeing slots for a couple reason: if there is thread contention __log_wrlsn_server may not get scheduled; it uses yields and sleeps so it may not notice when slots become freeable; and because the thread waiting for a FREE slot in __wt_log_slot_close is also using yields and sleeps, it may not notice right away when a slot is freed. The patch addresses this issue by pulling the slot-freeing logic from the loop of __log_wrlsn_server out into a function __log_wrlsn which is then called from __wt_log_slot_close when it has scanned all the slots and not found a FREE one. This call is made with the log_slot_lock held for thread-safety, but that's ok because at that point any thread that would have entered that lock would have become stuck anyway due to lack of FREE slots.
- By adding some messages to the code I noticed that often when threads were stuck in __wt_log_slot_close waiting for a FREE slot there were many WRITTEN slots but no FREE slots because the oldest slot was not yet WRITTEN (either because it was waiting for i/o to complete, or actually more often was waiting for all threads that had joined the slot to copy their data into the buffer and transition the slot to DONE - presumably because one of the threads that had to do so was held up by contention.) In other words slots were like this:
As I understand the algorithm the only purpose of the WRITTEN slots is to keep track of holes in the log file (for example, 1000-2000 in the example above) so we can correctly advance the LSN - is that right? However they aren't doing so very efficiently - the same information could be recorded by coalescing the WRITTEN slots into a single one (more specifically, one for each hole in the log file), making the other slots FREE, like so:
Attached patch is a POC-level implementation of the above. Some performance numbers, for n mongod client threads doing inserts of tiny documents in 10k batches into a standalone mongod server on a machine with 12 cores (24 cpus):
- performance with a large number of threads has been about doubled
- there is still some negative scaling at large thread counts, so maybe there are additional bottlenecks to be addressed
So this seems good from a performance perspective, at least on this test. Have not done any functional testing on it. michael.cahill, sue.loverso, can you take a look and see if this makes sense to you?