[WT-1814] Deadlock caused by sweep server Created: 24/Mar/15  Updated: 24/Apr/15  Resolved: 09/Apr/15

Status: Closed
Project: WiredTiger
Component/s: None
Affects Version/s: None
Fix Version/s: WT2.5.2

Type: Task
Reporter: Sue LoVerso Assignee: Michael Cahill
Resolution: Fixed Votes: 0
Labels: Anyone, Bug

Issue Links:
Related
related to WT-1 placeholder WT-1 Closed
related to WT-2 What does metadata look like? Closed
related to WT-3 What file formats are required? Closed
related to WT-4 Flexible cursor traversals Closed
related to WT-5 How does pget work: is it necessary? Closed
related to WT-6 Complex schema example Closed
related to WT-7 Do we need the handle->err/errx methods? Closed
related to WT-8 Do we need table load, bulk-load and/... Closed
related to WT-9 Does adding schema need to be transac... Closed
related to WT-10 Basic "getting started" tutorial Closed
related to WT-11 placeholder #11 Closed
related to WT-12 Write more examples Closed
related to WT-1811 Change sweep to not wait on the dhand... Closed
is related to WT-1819 Split sweep into two passes Closed

 Description   

I noticed today that the Jenkins job for medium-lsm-compact was hung. It is a deadlock around the schema and dhandle locks, presumably. The job does not have line numbers. Here are the stacks of threads waiting:

Thread 10 (Thread 0x7f4904bfe700 (LWP 17853)):
#0  0x00007f4906164265 in __lll_lock_wait () from /lib64/libpthread.so.0
WT-1  0x00007f490615fdc1 in _L_lock_816 () from /lib64/libpthread.so.0
WT-2  0x00007f490615fcc7 in pthread_mutex_lock () from /lib64/libpthread.so.0
WT-3  0x0000000000481ee4 in __wt_conn_dhandle_discard_single ()
WT-4  0x00000000004117cc in __sweep_server ()
WT-5  0x00007f490615df18 in start_thread () from /lib64/libpthread.so.0
WT-6  0x00007f4905e93b2d in clone () from /lib64/libc.so.6
 
Thread 7 (Thread 0x7f49033fb700 (LWP 17856)):
#0  0x00007f4906164265 in __lll_lock_wait () from /lib64/libpthread.so.0
WT-1  0x00007f490615fdc1 in _L_lock_816 () from /lib64/libpthread.so.0
WT-2  0x00007f490615fcc7 in pthread_mutex_lock () from /lib64/libpthread.so.0
WT-3  0x00000000004211af in __lsm_worker_manager ()
WT-4  0x00007f490615df18 in start_thread () from /lib64/libpthread.so.0
WT-5  0x00007f4905e93b2d in clone () from /lib64/libc.so.6
 
Thread 6 (Thread 0x7f4902bfa700 (LWP 17857)):
#0  0x00007f4905e7cb97 in sched_yield () from /lib64/libc.so.6
WT-1  0x0000000000480bd7 in __conn_dhandle_open_lock ()
WT-2  0x00000000004815b7 in __wt_conn_btree_get ()
WT-3  0x000000000044b492 in __wt_session_get_btree ()
WT-4  0x0000000000481d8f in __wt_conn_dhandle_close_all ()
WT-5  0x00000000004419bf in __wt_schema_drop ()
WT-6  0x000000000049fa1f in __lsm_drop_file ()
WT-7  0x00000000004a056d in __wt_lsm_free_chunks ()
WT-8  0x00000000004255b9 in __lsm_worker ()
WT-9  0x00007f490615df18 in start_thread () from /lib64/libpthread.so.0
WT-10 0x00007f4905e93b2d in clone () from /lib64/libc.so.6
 
Thread 5 (Thread 0x7f4901bff700 (LWP 17858)):
#0  0x00007f4906164265 in __lll_lock_wait () from /lib64/libpthread.so.0
WT-1  0x00007f490615fdc1 in _L_lock_816 () from /lib64/libpthread.so.0
WT-2  0x00007f490615fcc7 in pthread_mutex_lock () from /lib64/libpthread.so.0
WT-3  0x000000000044b3a3 in __wt_session_get_btree ()
WT-4  0x000000000044b6d2 in __wt_session_get_btree_ckpt ()
WT-5  0x000000000048b13f in __wt_curfile_open ()
WT-6  0x00000000004498d0 in __wt_open_cursor ()
WT-7  0x0000000000449b25 in __session_open_cursor ()
WT-8  0x00000000004b04e4 in __wt_bloom_finalize ()
WT-9  0x000000000049ffbf in __wt_lsm_work_bloom ()
WT-10 0x00000000004255f5 in __lsm_worker ()
WT-11 0x00007f490615df18 in start_thread () from /lib64/libpthread.so.0
WT-12 0x00007f4905e93b2d in clone () from /lib64/libc.so.6
 
Thread 3 (Thread 0x7f4900bfd700 (LWP 17860)):
#0  0x00007f4906164265 in __lll_lock_wait () from /lib64/libpthread.so.0
WT-1  0x00007f490615fdc1 in _L_lock_816 () from /lib64/libpthread.so.0
WT-2  0x00007f490615fcc7 in pthread_mutex_lock () from /lib64/libpthread.so.0
WT-3  0x0000000000448706 in __session_create ()
WT-4  0x00000000004b04bb in __wt_bloom_finalize ()
WT-5  0x000000000049e5d2 in __wt_lsm_merge ()
WT-6  0x00000000004254d7 in __lsm_worker ()
WT-7  0x00007f490615df18 in start_thread () from /lib64/libpthread.so.0
WT-8  0x00007f4905e93b2d in clone () from /lib64/libc.so.6

I will try to repro on the AWS HDD machine.



 Comments   
Comment by Sue Loverso [ 24/Mar/15 ]

Alex Gorrod and [~michaelcahill] this is probably related to the sweep server locking changes from yesterday, like 38a208966af2f2ba1a34. It happened during the compact portion of the test.

Comment by Sue Loverso [ 24/Mar/15 ]

Actually I'm not 100% sure what the state of the branch is. I think that changeset was reverted and the tree is back to its original state. It looks like this may be another instance of whatever hang [~michaelcahill] saw in automated testing, but there is no issue number referenced.

Comment by Michael Cahill [ 24/Mar/15 ]

The current pull request for this is WT-1811.

That said, we know that change causes other problems.

This is a pretty simple lock ordering issue: sweep has an exclusive handle lock and waits on the handle list lock. The normal order from other threads is to get the handle list lock first, then lock the handle lock.

The thought I had overnight about this was to split sweep into two passes: the first would go through the handles without the handle list lock (which is safe because sweep is the only thread that removes handles from the list). In the first pass, we would just be looking for trees to close and checking if there are any closed handles in the list.

If we see closed handles in the first pass, do a second pass holding the list lock that removes closed handles from the list.

This way, sweep should never try to acquire the handle list lock while it is holding a handle lock. I'll code that up today and open a new pull request.

Comment by Ramon Fernandez [ 16/Apr/15 ]

Additional ticket information from GitHub

This ticket was referenced in the following commits:
Generated at Thu Jul 19 11:51:32 UTC 2018 using JIRA 7.8.2#78002-sha1:944b71ecbe2e09c23503821098ef280c785b44a8.