Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-1989

Improvements to log slot freeing to improve thread scalability

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: WT2.7.0
    • Labels:
      None

      Description

      Improvements to log slot freeing to improve thread scalability

      Investigated the negative scaling of writeahead log seen in SERVER-18908 and SERVER-19189. Found two issues; experimental patch that appears to address both attached.

      • Threads are often waiting because there are no FREE slots. Slots are freed by the __log_wrlsn_server. However because it is done asyncronously there may be unnecessary delay in freeing slots for a couple reason: if there is thread contention __log_wrlsn_server may not get scheduled; it uses yields and sleeps so it may not notice when slots become freeable; and because the thread waiting for a FREE slot in __wt_log_slot_close is also using yields and sleeps, it may not notice right away when a slot is freed. The patch addresses this issue by pulling the slot-freeing logic from the loop of __log_wrlsn_server out into a function __log_wrlsn which is then called from __wt_log_slot_close when it has scanned all the slots and not found a FREE one. This call is made with the log_slot_lock held for thread-safety, but that's ok because at that point any thread that would have entered that lock would have become stuck anyway due to lack of FREE slots.
      • By adding some messages to the code I noticed that often when threads were stuck in __wt_log_slot_close waiting for a FREE slot there were many WRITTEN slots but no FREE slots because the oldest slot was not yet WRITTEN (either because it was waiting for i/o to complete, or actually more often was waiting for all threads that had joined the slot to copy their data into the buffer and transition the slot to DONE - presumably because one of the threads that had to do so was held up by contention.) In other words slots were like this:

        SLOT: start_lsn=1000, end_lsn=2000, state<DONE (i.e. threads copying data into the slot buffer)
        SLOT: start_lsn=2000, end_lsn=3000, state=WRITTEN (i.e. slot has been written to disk and is now waiting to be freed)
        SLOT: start_lsn=3000, end_lsn=4000, state=WRITTEN (i.e. slot has been written to disk and is now waiting to be freed)
        SLOT: start_lsn=4000, end_lsn=5000, state=WRITTEN (i.e. slot has been written to disk and is now waiting to be freed)
        SLOT: start_lsn=5000, end_lsn=6000, state=WRITTEN (i.e. slot has been written to disk and is now waiting to be freed)
        

        As I understand the algorithm the only purpose of the WRITTEN slots is to keep track of holes in the log file (for example, 1000-2000 in the example above) so we can correctly advance the LSN - is that right? However they aren't doing so very efficiently - the same information could be recorded by coalescing the WRITTEN slots into a single one (more specifically, one for each hole in the log file), making the other slots FREE, like so:

        SLOT: start_lsn=1000, end_lsn=2000, state<DONE (i.e. threads copying data into the slot buffer)
        SLOT: start_lsn=2000, end_lsn=6000, state=WRITTEN (i.e. slot has been written to disk and is now waiting to be freed)
        SLOT: state=FREE
        SLOT: state=FREE
        SLOT: state=FREE
        

      Attached patch is a POC-level implementation of the above. Some performance numbers, for n mongod client threads doing inserts of tiny documents in 10k batches into a standalone mongod server on a machine with 12 cores (24 cpus):

      threads    3.0.4        3.0.4
                              +WTlog.patch
       
       8        278401        280608
      16        379076        405451
      24        232358        407481
      32        158440        334523
      48        125652        246961
      64        118095        220157
      

      • performance with a large number of threads has been about doubled
      • there is still some negative scaling at large thread counts, so maybe there are additional bottlenecks to be addressed

      So this seems good from a performance perspective, at least on this test. Have not done any functional testing on it. Michael Cahill, Sue LoVerso, can you take a look and see if this makes sense to you?

        Attachments

        1. demo.c
          14 kB
        2. maxused.patch
          2 kB
        3. repro.sh
          0.8 kB
        4. WTlog.patch
          8 kB
        5. WTlog2.patch
          10 kB

          Issue Links

            Activity

              People

              • Assignee:
                sue.loverso Sue LoVerso
                Reporter:
                bruce.lucas Bruce Lucas
              • Votes:
                0 Vote for this issue
                Watchers:
                16 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: