Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-1230

LSM queue growing too large

    • Type: Icon: Task Task
    • Resolution: Done
    • None
    • Affects Version/s: None
    • Component/s: None
    • Labels:

      When I ran the close-debug branch on the AWS HD box, it hung overnight during the populate phase (medium-multi-lsm). The appq is growing very large. I put a breakpoint in __wt_lsm_manager_pop_entry after acquiring the app spin lock and executed this gdb function. I killed it after a couple minutes while typing this up, as it hadn't finished yet, and the queue length was over 400K! [EDIT: I ran it a second time, intending to let it finish and ran out of patience and that one was over 400K, not the 200K below]

      (gdb) define qcount
      >set $c = 0
      >set $p = (&manager->appqh)->tqh_first
      >while ($p != 0)
       >set $c = $c + 1
       >set $p = ($p)->q.tqe_next
       >end
      >printf "App Q units queued: %d", $c
      >end
      (gdb) qcount
      ^CQuit
      (gdb) print $c
      $9 = 203489
      

      Two of the three LSM worker threads are waiting on the evict_waiter_cond. Here are the stacks of some interesting threads. Thread 5 is the LSM worker "stuck" in the app queue for loop. Thread 10 is the eviction thread. Threads 3 and 4 are the other LSM workers stuck in __wt_cache_full_check and Thread 2 is an application thread similarly stuck:

      Thread 10 (Thread 0x7f2e9ec72700 (LWP 30474)):
      #0  __wt_tree_walk (session=0x1ef3990, refp=0x7f2e6c068c38, flags=736)
          at ../src/btree/bt_walk.c:205
      WT-1  0x0000000000459fdd in __evict_walk_file (session=0x1ef3990, 
          slotp=0x7f2e9ec71d0c, flags=3) at ../src/btree/bt_evict.c:988
      WT-2  0x0000000000459a86 in __evict_walk (session=0x1ef3990, 
          entriesp=0x7f2e9ec71d7c, flags=3) at ../src/btree/bt_evict.c:895
      WT-3  0x0000000000459494 in __evict_lru (session=0x1ef3990, flags=3)
          at ../src/btree/bt_evict.c:722
      WT-4  0x0000000000458e97 in __evict_pass (session=0x1ef3990)
          at ../src/btree/bt_evict.c:484
      WT-5  0x0000000000458082 in __evict_server (arg=0x1ef3990)
          at ../src/btree/bt_evict.c:162
      
      Thread 5 (Thread 0x7f2e9c46d700 (LWP 30479)):
      #0  __wt_lsm_manager_pop_entry (session=0x1ef4890, type=18, 
          entryp=0x7f2e9c46ce38) at ../src/lsm/lsm_manager.c:502
      WT-1  0x0000000000427f36 in __lsm_worker_general_op (session=0x1ef4890, 
          cookie=0x1f2cf30, completed=0x7f2e9c46ce70) at ../src/lsm/lsm_worker.c:43
      WT-2  0x00000000004281dd in __lsm_worker (arg=0x1f2cf30)
          at ../src/lsm/lsm_worker.c:111
      
      Thread 4 (Thread 0x7f2e9bc6c700 (LWP 30480)):
      #0  0x00007f2e9fb55e2e in pthread_cond_timedwait@@GLIBC_2.3.2 ()
         from /lib64/libpthread.so.0
      WT-1  0x000000000042c98c in __wt_cond_wait (session=0x1ef4b10, cond=0x1f0ade0, 
          usecs=100000) at ../src/os_posix/os_mtx.c:77
      WT-2  0x00000000004aefca in __wt_cache_full_check (session=0x1ef4b10)
          at ../src/include/cache.i:166
      WT-3  0x00000000004af0ff in __cursor_enter (session=0x1ef4b10)
          at ../src/include/cursor.i:62
      WT-4  0x00000000004af1d8 in __curfile_enter (cbt=0x7f2e700ae310)
          at ../src/include/cursor.i:99
      WT-5  0x00000000004af305 in __cursor_func_init (cbt=0x7f2e700ae310, reenter=0)
          at ../src/include/cursor.i:145
      WT-6  0x00000000004b0b59 in __wt_btcur_next (cbt=0x7f2e700ae310, truncating=0)
          at ../src/btree/bt_curnext.c:399
      WT-7  0x000000000048ade1 in __curfile_next (cursor=0x7f2e700ae310)
          at ../src/cursor/cur_file.c:79
      WT-8  0x000000000049d39b in __clsm_next (cursor=0x7f2e70dc8d40)
          at ../src/lsm/lsm_cursor.c:794
      WT-9  0x00000000004a50b6 in __lsm_bloom_create (session=0x1ef4b10, 
          lsm_tree=0x1f6e2f0, chunk=0x7f2e6c066f20, chunk_off=2)
          at ../src/lsm/lsm_work_unit.c:376
      WT-10 0x00000000004a479f in __wt_lsm_work_bloom (session=0x1ef4b10, 
          lsm_tree=0x1f6e2f0) at ../src/lsm/lsm_work_unit.c:191
      WT-11 0x00000000004280cf in __lsm_worker_general_op (session=0x1ef4b10, 
          cookie=0x1f2cf50, completed=0x7f2e9bc6be70) at ../src/lsm/lsm_worker.c:69
      ---Type <return> to continue, or q <return> to quit--- 
      WT-12 0x00000000004281dd in __lsm_worker (arg=0x1f2cf50)
          at ../src/lsm/lsm_worker.c:111
      
      Thread 3 (Thread 0x7f2e9b46b700 (LWP 30481)):
      #0  0x00007f2e9fb55e2e in pthread_cond_timedwait@@GLIBC_2.3.2 ()
         from /lib64/libpthread.so.0
      WT-1  0x000000000042c98c in __wt_cond_wait (session=0x1ef4d90, cond=0x1f0ade0, 
          usecs=100000) at ../src/os_posix/os_mtx.c:77
      WT-2  0x00000000004aefca in __wt_cache_full_check (session=0x1ef4d90)
          at ../src/include/cache.i:166
      WT-3  0x00000000004af0ff in __cursor_enter (session=0x1ef4d90)
          at ../src/include/cursor.i:62
      WT-4  0x00000000004af1d8 in __curfile_enter (cbt=0x7f2e78fc1700)
          at ../src/include/cursor.i:99
      WT-5  0x00000000004af305 in __cursor_func_init (cbt=0x7f2e78fc1700, reenter=0)
          at ../src/include/cursor.i:145
      WT-6  0x00000000004b0b59 in __wt_btcur_next (cbt=0x7f2e78fc1700, truncating=0)
          at ../src/btree/bt_curnext.c:399
      WT-7  0x000000000048ade1 in __curfile_next (cursor=0x7f2e78fc1700)
          at ../src/cursor/cur_file.c:79
      WT-8  0x000000000049d39b in __clsm_next (cursor=0x7f2e78357b20)
          at ../src/lsm/lsm_cursor.c:794
      WT-9  0x00000000004a50b6 in __lsm_bloom_create (session=0x1ef4d90, 
          lsm_tree=0x1f6e2f0, chunk=0x7f2e6c01ec20, chunk_off=3)
          at ../src/lsm/lsm_work_unit.c:376
      WT-10 0x00000000004a479f in __wt_lsm_work_bloom (session=0x1ef4d90, 
          lsm_tree=0x1f6e2f0) at ../src/lsm/lsm_work_unit.c:191
      WT-11 0x00000000004280cf in __lsm_worker_general_op (session=0x1ef4d90, 
          cookie=0x1f2cf70, completed=0x7f2e9b46ae70) at ../src/lsm/lsm_worker.c:69
      WT-12 0x00000000004281dd in __lsm_worker (arg=0x1f2cf70)
          at ../src/lsm/lsm_worker.c:111
      
      Thread 2 (Thread 0x7f2e9ac6a700 (LWP 30482)):
      #0  0x00007f2e9fb55e2e in pthread_cond_timedwait@@GLIBC_2.3.2 ()
         from /lib64/libpthread.so.0
      WT-1  0x000000000042c98c in __wt_cond_wait (session=0x1ef4390, cond=0x1f0ade0, 
          usecs=100000) at ../src/os_posix/os_mtx.c:77
      WT-2  0x000000000049ae53 in __wt_cache_full_check (session=0x1ef4390)
          at ../src/include/cache.i:166
      WT-3  0x000000000049b423 in __clsm_enter (clsm=0x7f2e8c001ae0, reset=0, update=1)
          at ../src/lsm/lsm_cursor.c:164
      WT-4  0x000000000049f450 in __clsm_insert (cursor=0x7f2e8c001ae0)
          at ../src/lsm/lsm_cursor.c:1330
      WT-5  0x0000000000407195 in populate_thread (arg=0x1f54050)
          at ../../../bench/wtperf/wtperf.c:818
      

      It isn't clear to me that the queue length is the real issue, or whether eviction is not making enough progress to unplug the other two threads so that they can help with the work queue. The first time I attached to the process, the eviction server was sleeping, while all Threads 2, 3, 4 were still waiting on the condition:

      Thread 10 (Thread 0x7f2e9ec72700 (LWP 30474)):
      #0  0x00007f2e9fb55e2e in pthread_cond_timedwait@@GLIBC_2.3.2 ()
         from /lib64/libpthread.so.0
      WT-1  0x000000000042c98c in __wt_cond_wait (session=0x1ef3990, cond=0x1f0ad70, 
          usecs=100000) at ../src/os_posix/os_mtx.c:77
      WT-2  0x00000000004582ab in __evict_server (arg=0x1ef3990)
          at ../src/btree/bt_evict.c:194
      

            Assignee:
            Unassigned Unassigned
            Reporter:
            sue.loverso@mongodb.com Susan LoVerso
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: