Details

    • Type: Task
    • Status: Closed
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: WT2.5.2
    • Labels:

      Description

      I noticed today that the Jenkins job for medium-lsm-compact was hung. It is a deadlock around the schema and dhandle locks, presumably. The job does not have line numbers. Here are the stacks of threads waiting:

      Thread 10 (Thread 0x7f4904bfe700 (LWP 17853)):
      #0  0x00007f4906164265 in __lll_lock_wait () from /lib64/libpthread.so.0
      WT-1  0x00007f490615fdc1 in _L_lock_816 () from /lib64/libpthread.so.0
      WT-2  0x00007f490615fcc7 in pthread_mutex_lock () from /lib64/libpthread.so.0
      WT-3  0x0000000000481ee4 in __wt_conn_dhandle_discard_single ()
      WT-4  0x00000000004117cc in __sweep_server ()
      WT-5  0x00007f490615df18 in start_thread () from /lib64/libpthread.so.0
      WT-6  0x00007f4905e93b2d in clone () from /lib64/libc.so.6
       
      Thread 7 (Thread 0x7f49033fb700 (LWP 17856)):
      #0  0x00007f4906164265 in __lll_lock_wait () from /lib64/libpthread.so.0
      WT-1  0x00007f490615fdc1 in _L_lock_816 () from /lib64/libpthread.so.0
      WT-2  0x00007f490615fcc7 in pthread_mutex_lock () from /lib64/libpthread.so.0
      WT-3  0x00000000004211af in __lsm_worker_manager ()
      WT-4  0x00007f490615df18 in start_thread () from /lib64/libpthread.so.0
      WT-5  0x00007f4905e93b2d in clone () from /lib64/libc.so.6
       
      Thread 6 (Thread 0x7f4902bfa700 (LWP 17857)):
      #0  0x00007f4905e7cb97 in sched_yield () from /lib64/libc.so.6
      WT-1  0x0000000000480bd7 in __conn_dhandle_open_lock ()
      WT-2  0x00000000004815b7 in __wt_conn_btree_get ()
      WT-3  0x000000000044b492 in __wt_session_get_btree ()
      WT-4  0x0000000000481d8f in __wt_conn_dhandle_close_all ()
      WT-5  0x00000000004419bf in __wt_schema_drop ()
      WT-6  0x000000000049fa1f in __lsm_drop_file ()
      WT-7  0x00000000004a056d in __wt_lsm_free_chunks ()
      WT-8  0x00000000004255b9 in __lsm_worker ()
      WT-9  0x00007f490615df18 in start_thread () from /lib64/libpthread.so.0
      WT-10 0x00007f4905e93b2d in clone () from /lib64/libc.so.6
       
      Thread 5 (Thread 0x7f4901bff700 (LWP 17858)):
      #0  0x00007f4906164265 in __lll_lock_wait () from /lib64/libpthread.so.0
      WT-1  0x00007f490615fdc1 in _L_lock_816 () from /lib64/libpthread.so.0
      WT-2  0x00007f490615fcc7 in pthread_mutex_lock () from /lib64/libpthread.so.0
      WT-3  0x000000000044b3a3 in __wt_session_get_btree ()
      WT-4  0x000000000044b6d2 in __wt_session_get_btree_ckpt ()
      WT-5  0x000000000048b13f in __wt_curfile_open ()
      WT-6  0x00000000004498d0 in __wt_open_cursor ()
      WT-7  0x0000000000449b25 in __session_open_cursor ()
      WT-8  0x00000000004b04e4 in __wt_bloom_finalize ()
      WT-9  0x000000000049ffbf in __wt_lsm_work_bloom ()
      WT-10 0x00000000004255f5 in __lsm_worker ()
      WT-11 0x00007f490615df18 in start_thread () from /lib64/libpthread.so.0
      WT-12 0x00007f4905e93b2d in clone () from /lib64/libc.so.6
       
      Thread 3 (Thread 0x7f4900bfd700 (LWP 17860)):
      #0  0x00007f4906164265 in __lll_lock_wait () from /lib64/libpthread.so.0
      WT-1  0x00007f490615fdc1 in _L_lock_816 () from /lib64/libpthread.so.0
      WT-2  0x00007f490615fcc7 in pthread_mutex_lock () from /lib64/libpthread.so.0
      WT-3  0x0000000000448706 in __session_create ()
      WT-4  0x00000000004b04bb in __wt_bloom_finalize ()
      WT-5  0x000000000049e5d2 in __wt_lsm_merge ()
      WT-6  0x00000000004254d7 in __lsm_worker ()
      WT-7  0x00007f490615df18 in start_thread () from /lib64/libpthread.so.0
      WT-8  0x00007f4905e93b2d in clone () from /lib64/libc.so.6
      

      I will try to repro on the AWS HDD machine.

        Issue Links

          Activity

          Hide
          sueloverso Sue Loverso added a comment -

          Alex Gorrod and [~michaelcahill] this is probably related to the sweep server locking changes from yesterday, like 38a208966af2f2ba1a34. It happened during the compact portion of the test.

          Show
          sueloverso Sue Loverso added a comment - Alex Gorrod and [~michaelcahill] this is probably related to the sweep server locking changes from yesterday, like 38a208966af2f2ba1a34. It happened during the compact portion of the test.
          Hide
          sueloverso Sue Loverso added a comment -

          Actually I'm not 100% sure what the state of the branch is. I think that changeset was reverted and the tree is back to its original state. It looks like this may be another instance of whatever hang [~michaelcahill] saw in automated testing, but there is no issue number referenced.

          Show
          sueloverso Sue Loverso added a comment - Actually I'm not 100% sure what the state of the branch is. I think that changeset was reverted and the tree is back to its original state. It looks like this may be another instance of whatever hang [~michaelcahill] saw in automated testing, but there is no issue number referenced.
          Hide
          michael.cahill Michael Cahill added a comment -

          The current pull request for this is WT-1811.

          That said, we know that change causes other problems.

          This is a pretty simple lock ordering issue: sweep has an exclusive handle lock and waits on the handle list lock. The normal order from other threads is to get the handle list lock first, then lock the handle lock.

          The thought I had overnight about this was to split sweep into two passes: the first would go through the handles without the handle list lock (which is safe because sweep is the only thread that removes handles from the list). In the first pass, we would just be looking for trees to close and checking if there are any closed handles in the list.

          If we see closed handles in the first pass, do a second pass holding the list lock that removes closed handles from the list.

          This way, sweep should never try to acquire the handle list lock while it is holding a handle lock. I'll code that up today and open a new pull request.

          Show
          michael.cahill Michael Cahill added a comment - The current pull request for this is WT-1811 . That said, we know that change causes other problems. This is a pretty simple lock ordering issue: sweep has an exclusive handle lock and waits on the handle list lock. The normal order from other threads is to get the handle list lock first, then lock the handle lock. The thought I had overnight about this was to split sweep into two passes: the first would go through the handles without the handle list lock (which is safe because sweep is the only thread that removes handles from the list). In the first pass, we would just be looking for trees to close and checking if there are any closed handles in the list. If we see closed handles in the first pass, do a second pass holding the list lock that removes closed handles from the list. This way, sweep should never try to acquire the handle list lock while it is holding a handle lock. I'll code that up today and open a new pull request.
          Hide
          ramon.fernandez Ramon Fernandez added a comment -

          Additional ticket information from GitHub

          This ticket was referenced in the following commits:
          Show
          ramon.fernandez Ramon Fernandez added a comment - Additional ticket information from GitHub This ticket was referenced in the following commits: 269e847ad64dd12dfcadb58f84f905069e5b8dce

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:
                Days since reply:
                2 years, 10 weeks, 5 days ago
                Date of 1st Reply: