Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-1720

Threads blocked due to cache overflow in LSM

    • Type: Icon: Task Task
    • Resolution: Done
    • None
    • Affects Version/s: None
    • Component/s: None

      There is a test/format LSM job that got stuck. The configuration file is:

      ############################################
      #  RUN PARAMETERS
      ############################################
      abort=0
      auto_throttle=1
      firstfit=0
      bitcnt=2
      bloom=1
      bloom_bit_count=4
      bloom_hash_count=24
      bloom_oldest=0
      cache=30
      checkpoints=1
      checksum=uncompressed
      chunk_size=1
      compaction=0
      compression=zlib
      data_extend=0
      data_source=lsm
      delete_pct=18
      dictionary=0
      evict_max=5
      file_type=row-store
      backups=0
      huffman_key=0
      huffman_value=0
      insert_pct=83
      internal_key_truncation=1
      internal_page_max=14
      isolation=read-uncommitted
      key_gap=1
      key_max=122
      key_min=10
      leak_memory=0
      leaf_page_max=11
      logging=0
      logging_archive=1
      logging_prealloc=1
      logging=0
      lsm_worker_threads=4
      merge_max=13
      mmap=1
      ops=100000
      prefix_compression=1
      prefix_compression_min=6
      repeat_data_pct=54
      reverse=0
      rows=100000
      runs=100
      split_pct=67
      statistics=0
      statistics_server=0
      threads=21
      timer=0
      value_max=3202
      value_min=15
      wiredtiger_config=
      write_pct=66
      ############################################
      

      The LSM tree has 20 active chunks. Of those chunks 5 are flushed, the rest are all in memory. The non-flushed chunks are filling the cache.

      There are 4 LSM worker threads, one of which is the manager. One thread can only do switch and drop operations (that thread is idle), one of which is currently doing a merge, but stuck with cache full:

      Thread 28 (Thread 0x7f6dfc378700 (LWP 110983)):
      #0  pthread_cond_timedwait@@GLIBC_2.3.2 ()
          at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
      WT-1  0x00000000004402f6 in __wt_cond_wait (session=0x2525fe0, cond=0x25208c0, 
          usecs=100000) at ../src/os_posix/os_mtx_cond.c:78
      WT-2  0x000000000042c031 in __wt_cache_wait (session=0x2525fe0, full=131)
          at ../src/evict/evict_lru.c:1464
      WT-3  0x00000000004dd1cf in __wt_cache_full_check (session=0x2525fe0)
          at ../src/include/cache.i:197
      WT-4  0x00000000004de510 in __cursor_enter (session=0x2525fe0)
          at ../src/include/cursor.i:63
      WT-5  0x00000000004de5d8 in __curfile_enter (cbt=0x7f6de4061b20)
          at ../src/include/cursor.i:96
      WT-6  0x00000000004de791 in __cursor_func_init (cbt=0x7f6de4061b20, reenter=0)
          at ../src/include/cursor.i:198
      WT-7  0x00000000004e00a9 in __wt_btcur_next (cbt=0x7f6de4061b20, truncating=0)
          at ../src/btree/bt_curnext.c:415
      WT-8  0x00000000004b144b in __curfile_next (cursor=0x7f6de4061b20)
          at ../src/cursor/cur_file.c:113
      WT-9  0x00000000004c7373 in __clsm_next (cursor=0x7f6de4183e10)
      ---Type <return> to continue, or q <return> to quit---
          at ../src/lsm/lsm_cursor.c:795
      WT-10 0x00000000004cb4cc in __wt_lsm_merge (session=0x2525fe0, 
          lsm_tree=0x25034e0, id=2) at ../src/lsm/lsm_merge.c:346
      WT-11 0x000000000043b497 in __lsm_worker (arg=0x2517920)
          at ../src/lsm/lsm_worker.c:138
      

      One of which is creating a bloom filter, and is stuck waiting for the cache to get less full:

      Thread 27 (Thread 0x7f6dfbb77700 (LWP 110984)):
      #0  pthread_cond_timedwait@@GLIBC_2.3.2 ()
          at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
      WT-1  0x00000000004402f6 in __wt_cond_wait (session=0x25262e0, cond=0x25208c0, 
          usecs=100000) at ../src/os_posix/os_mtx_cond.c:78
      WT-2  0x000000000042c031 in __wt_cache_wait (session=0x25262e0, full=131)
          at ../src/evict/evict_lru.c:1464
      WT-3  0x00000000004dd1cf in __wt_cache_full_check (session=0x25262e0)
          at ../src/include/cache.i:197
      WT-4  0x00000000004de510 in __cursor_enter (session=0x25262e0)
          at ../src/include/cursor.i:63
      WT-5  0x00000000004de5d8 in __curfile_enter (cbt=0x7f6da41c7800)
          at ../src/include/cursor.i:96
      WT-6  0x00000000004de791 in __cursor_func_init (cbt=0x7f6da41c7800, reenter=0)
          at ../src/include/cursor.i:198
      WT-7  0x00000000004e00a9 in __wt_btcur_next (cbt=0x7f6da41c7800, truncating=0)
          at ../src/btree/bt_curnext.c:415
      WT-8  0x00000000004b144b in __curfile_next (cursor=0x7f6da41c7800)
          at ../src/cursor/cur_file.c:113
      WT-9  0x00000000004c7373 in __clsm_next (cursor=0x7f6da4c21a20)
          at ../src/lsm/lsm_cursor.c:795
      WT-10 0x00000000004ce9ad in __lsm_bloom_create (session=0x25262e0, 
          lsm_tree=0x25034e0, chunk=0x7f6d48003d30, chunk_off=7)
          at ../src/lsm/lsm_work_unit.c:405
      WT-11 0x00000000004ce124 in __wt_lsm_work_bloom (session=0x25262e0, 
          lsm_tree=0x25034e0) at ../src/lsm/lsm_work_unit.c:224
      WT-12 0x000000000043b2cf in __lsm_worker_general_op (session=0x25262e0, 
          cookie=0x2517948, completed=0x7f6dfbb76ee0) at ../src/lsm/lsm_worker.c:74
      WT-13 0x000000000043b3bf in __lsm_worker (arg=0x2517948)
          at ../src/lsm/lsm_worker.c:122
      

      I think creating bloom filters doesn't expect to get stuck waiting for space in the cache. In

      Unable to find source-code formatter for language: __lsm_bloom_create```. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml
       we set the 

      WT_SESSION_NO_CACHE_CHECK

       flag when doing a post create traversal of the bloom filter. We don't set that flag when traversing the chunk to create the bloom filter itself, even though we set 

      WT_SESSION_NO_CACHE{{`}}.

      Alternatively we could fiddle with the LSM worker thread work unit assignments, so that the thread that only does switches and drops (very short lived operations) could do flushes as well if we've stopped making progress. The difficulty would be in determining when we are and aren't making progress.

            Assignee:
            alexander.gorrod@mongodb.com Alexander Gorrod
            Reporter:
            alexander.gorrod@mongodb.com Alexander Gorrod
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: