Details
Description
In a local run of test/format, I encountered a failure with format run more than 15 minutes past the maximum time. Interestingly the cache isn't full:
0x7f1dc320b010:cache clean check: no (7.466%)
|
0x7f1dc320b010:cache dirty check: no (2.422%)
|
The call stacks show that all user threads are here:
#2 __wt_spin_backoff (yield_count=0x7f1daa7e16b0, sleep_usecs=0x7f1daa7e16a8)
|
at ../src/include/misc.i:245
|
#3 __wt_page_in_func (session=0x7f1dc3215c58, ref=0x7f1d5081d2f0, flags=3746,
|
func=0x55aeac0a8310 <__func__.35452> "__tree_walk_internal", line=527) at ../src/btree/bt_read.c:864
|
#4 __wt_page_swap_func (session=0x7f1dc3215c58, held=0x0, want=0x7f1d5081d2f0, flags=3746,
|
func=0x55aeac0a8310 <__func__.35452> "__tree_walk_internal", line=527) at ../src/include/btree.i:1702
|
#5 __tree_walk_internal (session=0x7f1dc3215c58, refp=0x7f1d94146238, walkcntp=0x0,
|
skip_func=0x0, func_cookie=0x0, flags=2690) at ../src/btree/bt_walk.c:526
|
#6 __wt_tree_walk (session=0x7f1dc3215c58, refp=0x7f1d94146238, flags=2688)
|
at ../src/btree/bt_walk.c:597
|
#7 __wt_btcur_prev (cbt=0x7f1d941460c0, truncating=false) at ../src/btree/bt_curprev.c:675
|
#8 __curfile_prev (cursor=0x7f1d941460c0) at ../src/cursor/cur_file.c:147
|
#9 read_op (cursor=0x7f1d941460c0, op=PREV, exactp=0x0) at ../../../test/format/ops.c:325
|
#10 nextprev (tinfo=0x55aeaed34f90, cursor=0x7f1d941460c0, next=false)
|
at ../../../test/format/ops.c:1364
|
#11 ops (arg=0x55aeaed34f90) at ../../../test/format/ops.c:1107
|
That is spinning in page in, waiting for a page that is in WT_REF_LOCKED (or similar). The checkpoint thread is attempting to do a reverse split:
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
|
#1 0x00007f1dc2660023 in __GI___pthread_mutex_lock (mutex=0x7f1d482b76b8) at ../nptl/pthread_mutex_lock.c:78
|
#2 0x000055aeabfb5ca0 in __wt_spin_lock (session=0x7f1dc3212920, t=0x7f1d482b76b8) at ../src/include/mutex.i:173
|
#3 0x000055aeabfbc85e in __split_internal_lock (session=0x7f1dc3212920, ref=0x7f1d5081d2f0, trylock=false,
|
parentp=0x7f1dac7e7620) at ../src/btree/bt_split.c:1251
|
#4 0x000055aeabfbf21b in __split_reverse (session=0x7f1dc3212920, ref=0x7f1d5081d2f0) at ../src/btree/bt_split.c:2229
|
#5 0x000055aeabfbf313 in __wt_split_reverse (session=0x7f1dc3212920, ref=0x7f1d5081d2f0)
|
at ../src/btree/bt_split.c:2251
|
#6 0x000055aeabeeb7bb in __evict_delete_ref (session=0x7f1dc3212920, ref=0x7f1d5081d2f0, closing=false)
|
at ../src/evict/evict_page.c:284
|
#7 0x000055aeabeebbec in __evict_page_clean_update (session=0x7f1dc3212920, ref=0x7f1d5081d2f0, closing=false)
|
at ../src/evict/evict_page.c:336
|
#8 0x000055aeabeeb34b in __wt_evict (session=0x7f1dc3212920, ref=0x7f1d5081d2f0, closing=false, previous_state=5)
|
at ../src/evict/evict_page.c:219
|
#9 0x000055aeabeeacb2 in __wt_page_release_evict (session=0x7f1dc3212920, ref=0x7f1d5081d2f0)
|
at ../src/evict/evict_page.c:92
|
#10 0x000055aeabf15d93 in __wt_page_release (session=0x7f1dc3212920, ref=0x7f1d5081d2f0, flags=136)
|
at ../src/include/btree.i:1521
|
#11 0x000055aeabf19697 in __rec_child_modify (session=0x7f1dc3212920, r=0x7f1d6408d430, ref=0x7f1d5081d2f0,
|
hazardp=0x7f1dac7e7891, statep=0x7f1dac7e7894) at ../src/reconcile/rec_write.c:1922
|
#12 0x000055aeabf1fe17 in __rec_row_int (session=0x7f1dc3212920, r=0x7f1d6408d430, page=0x55aeb07b0400)
|
at ../src/reconcile/rec_write.c:4854
|
#13 0x000055aeabf1674c in __wt_reconcile (session=0x7f1dc3212920, ref=0x55aeb0662db0, salvage=0x0, flags=1,
|
lookaside_retryp=0x0) at ../src/reconcile/rec_write.c:469
|
#14 0x000055aeabfc387d in __wt_sync_file (session=0x7f1dc3212920, syncop=WT_SYNC_CHECKPOINT)
|
at ../src/btree/bt_sync.c:337
|
#15 0x000055aeabf69736 in __checkpoint_tree (session=0x7f1dc3212920, is_checkpoint=true, cfg=0x7f1dac7e7e20)
|
at ../src/txn/txn_ckpt.c:1667
|
#16 0x000055aeabf69a78 in __checkpoint_tree_helper (session=0x7f1dc3212920, cfg=0x7f1dac7e7e20)
|
at ../src/txn/txn_ckpt.c:1783
|
#17 0x000055aeabf65d44 in __checkpoint_apply (session=0x7f1dc3212920, cfg=0x7f1dac7e7e20,
|
op=0x55aeabf699ea <__checkpoint_tree_helper>) at ../src/txn/txn_ckpt.c:218
|
#18 0x000055aeabf67941 in __txn_checkpoint (session=0x7f1dc3212920, cfg=0x7f1dac7e7e20) at ../src/txn/txn_ckpt.c:909
|
#19 0x000055aeabf683ca in __txn_checkpoint_wrapper (session=0x7f1dc3212920, cfg=0x7f1dac7e7e20)
|
at ../src/txn/txn_ckpt.c:1107
|
#20 0x000055aeabf68546 in __wt_txn_checkpoint (session=0x7f1dc3212920, cfg=0x7f1dac7e7e20, waiting=true)
|
at ../src/txn/txn_ckpt.c:1171
|
#21 0x000055aeabf43f77 in __session_checkpoint (wt_session=0x7f1dc3212920, config=0x0)
|
at ../src/session/session_api.c:1984
|
#22 0x000055aeabeb1821 in checkpoint (arg=0x0) at ../../../test/format/util.c:571
|
#23 0x00007f1dc265d6db in start_thread (arg=0x7f1dac7e8700) at pthread_create.c:463
|
Interestingly there is a comment in __split_lock_internal indicating that we've dealt with the reverse of this in the past:
/*
|
* A checkpoint reconciling this parent page can deadlock with
|
* our split. We have an exclusive page lock on the child before
|
* we acquire the page's reconciliation lock, and reconciliation
|
* acquires the page's reconciliation lock before it encounters
|
* the child's exclusive lock (which causes reconciliation to
|
* loop until the exclusive lock is resolved). If we want to split
|
* the parent, give up to avoid that deadlock.
|
*/
|
The open question is who is holding that lock - locks are only taken when modifying pages - and all the other threads are doing reads, so shouldn't be holding the lock. Perhaps the checkpoint thread already has the lock?
Attachments
Issue Links
- related to
-
WT-4748 internal eviction flag handling could be simplified
-
- Closed
-