Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-1989

Improvements to log slot freeing to improve thread scalability

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • WT2.7.0
    • Affects Version/s: None
    • Component/s: None
    • Labels:
      None

      Improvements to log slot freeing to improve thread scalability

      Investigated the negative scaling of writeahead log seen in SERVER-18908 and SERVER-19189. Found two issues; experimental patch that appears to address both attached.

      • Threads are often waiting because there are no FREE slots. Slots are freed by the __log_wrlsn_server. However because it is done asyncronously there may be unnecessary delay in freeing slots for a couple reason: if there is thread contention __log_wrlsn_server may not get scheduled; it uses yields and sleeps so it may not notice when slots become freeable; and because the thread waiting for a FREE slot in __wt_log_slot_close is also using yields and sleeps, it may not notice right away when a slot is freed. The patch addresses this issue by pulling the slot-freeing logic from the loop of __log_wrlsn_server out into a function __log_wrlsn which is then called from __wt_log_slot_close when it has scanned all the slots and not found a FREE one. This call is made with the log_slot_lock held for thread-safety, but that's ok because at that point any thread that would have entered that lock would have become stuck anyway due to lack of FREE slots.
      • By adding some messages to the code I noticed that often when threads were stuck in __wt_log_slot_close waiting for a FREE slot there were many WRITTEN slots but no FREE slots because the oldest slot was not yet WRITTEN (either because it was waiting for i/o to complete, or actually more often was waiting for all threads that had joined the slot to copy their data into the buffer and transition the slot to DONE - presumably because one of the threads that had to do so was held up by contention.) In other words slots were like this:
        SLOT: start_lsn=1000, end_lsn=2000, state<DONE (i.e. threads copying data into the slot buffer)
        SLOT: start_lsn=2000, end_lsn=3000, state=WRITTEN (i.e. slot has been written to disk and is now waiting to be freed)
        SLOT: start_lsn=3000, end_lsn=4000, state=WRITTEN (i.e. slot has been written to disk and is now waiting to be freed)
        SLOT: start_lsn=4000, end_lsn=5000, state=WRITTEN (i.e. slot has been written to disk and is now waiting to be freed)
        SLOT: start_lsn=5000, end_lsn=6000, state=WRITTEN (i.e. slot has been written to disk and is now waiting to be freed)
        

        As I understand the algorithm the only purpose of the WRITTEN slots is to keep track of holes in the log file (for example, 1000-2000 in the example above) so we can correctly advance the LSN - is that right? However they aren't doing so very efficiently - the same information could be recorded by coalescing the WRITTEN slots into a single one (more specifically, one for each hole in the log file), making the other slots FREE, like so:

        SLOT: start_lsn=1000, end_lsn=2000, state<DONE (i.e. threads copying data into the slot buffer)
        SLOT: start_lsn=2000, end_lsn=6000, state=WRITTEN (i.e. slot has been written to disk and is now waiting to be freed)
        SLOT: state=FREE
        SLOT: state=FREE
        SLOT: state=FREE
        

      Attached patch is a POC-level implementation of the above. Some performance numbers, for n mongod client threads doing inserts of tiny documents in 10k batches into a standalone mongod server on a machine with 12 cores (24 cpus):

      threads    3.0.4        3.0.4
                              +WTlog.patch
      
       8        278401        280608
      16        379076        405451
      24        232358        407481
      32        158440        334523
      48        125652        246961
      64        118095        220157
      
      • performance with a large number of threads has been about doubled
      • there is still some negative scaling at large thread counts, so maybe there are additional bottlenecks to be addressed

      So this seems good from a performance perspective, at least on this test. Have not done any functional testing on it. michael.cahill, sue.loverso, can you take a look and see if this makes sense to you?

        1. demo.c
          14 kB
        2. maxused.patch
          2 kB
        3. repro.sh
          0.8 kB
        4. WTlog.patch
          8 kB
        5. WTlog2.patch
          10 kB

            Assignee:
            sue.loverso@mongodb.com Susan LoVerso
            Reporter:
            bruce.lucas@mongodb.com Bruce Lucas (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            16 Start watching this issue

              Created:
              Updated:
              Resolved: