Details
-
Task
-
Status: Closed
-
Resolution: Done
-
None
-
None
-
None
-
None
Description
When I run against LinkBench, I occasionally see a hang:
34 pthread_cond_wait@@GLIBC_2.3.2,__wt_cond_wait,__wt_log_slot_wait,__wt_log_write,__wt_txn_log_commit,__wt_txn_commit,__session_commit_transaction
|
15 pthread_cond_wait@@GLIBC_2.3.2,__wt_cond_wait,__log_release,__wt_log_write,__wt_txn_log_commit,__wt_txn_commit,__session_commit_transaction
|
1 slot=0x27dd890),__wt_log_write,__wt_txn_log_commit,__wt_txn_commit,__session_commit_transaction
|
There are 50 application threads. 34 of those threads are waiting for their slots to be finalized. 15 of those threads are waiting for their chance to release their slot. 1 of those threads is attempting to find a new slot in the slot pool.
If I capture the hang in GDB I can see:
Thread 67 (Thread 0x7ff98827d700 (LWP 12971)):
|
#0 0x00007ff9aa10c7cc in __wt_log_slot_close (
|
session=session@entry=0x1bf4118, slot=0x1c06670)
|
at ../src/log/log_slot.c:197
|
WT-1 0x00007ff9aa10b440 in __wt_log_write (session=session@entry=0x1bf4118,
|
record=0x7ff57c011940, lsnp=lsnp@entry=0x0, flags=2)
|
at ../src/log/log.c:1018
|
WT-2 0x00007ff9aa13ad0d in __wt_txn_log_commit (
|
session=session@entry=0x1bf4118, cfg=cfg@entry=0x7ff98827abe0)
|
at ../src/txn/txn_log.c:129
|
WT-3 0x00007ff9aa1391a6 in __wt_txn_commit (session=session@entry=0x1bf4118,
|
cfg=cfg@entry=0x7ff98827abe0) at ../src/txn/txn.c:352
|
WT-4 0x00007ff9aa1317d5 in __session_commit_transaction (wt_session=0x1bf4118,
|
config=<optimised out>) at ../src/session/session_api.c:649
|
(More stack frames follow...)
|
|
(gdb) thread 67
|
[Switching to thread 67 (Thread 0x7ff98827d700 (LWP 12971))]
|
#0 0x00007ff9aa10c7cc in __wt_log_slot_close (
|
session=session@entry=0x1bf4118, slot=0x1c06670)
|
at ../src/log/log_slot.c:197
|
197 if (++log->pool_index >= SLOT_POOL)
|
(gdb) p slot->slot_release_lsn
|
$1 = {file = 155, offset = 42647808}
|
(gdb) p log->write_lsn
|
$2 = {file = 155, offset = 42648192}
|
(gdb) p *slot
|
$3 = {slot_state = 25347, slot_group_size = 384, slot_error = 0,
|
slot_index = 0, slot_start_offset = 42647808, slot_release_lsn = {
|
file = 155, offset = 42647808}, slot_start_lsn = {file = 155,
|
offset = 42647808}, slot_end_lsn = {file = 155, offset = 42648192},
|
slot_fh = 0x7ff4d1e3b680, slot_buf = {data = 0x7ff89bddf010, size = 0,
|
flags = 0, mem = 0x7ff89bddf010, memsize = 40706048},
|
slot_done_cond = 0x1c06d20, flags = 2}
|
(gdb) p slot->slot_end_lsn
|
$4 = {file = 155, offset = 42648192}
|
So.. The thread in
has a
|
that is one behind the current
|
. That should not happen.
|
|
I think there is a real issue in this code, if there are two or more active slots, then a slot may own the oldest slot_release_lsn, and get stuck waiting for an available slot in
|