-
Type: Task
-
Resolution: Done
-
None
-
Affects Version/s: None
-
Component/s: None
There is a test/format LSM job that got stuck. The configuration file is:
############################################ # RUN PARAMETERS ############################################ abort=0 auto_throttle=1 firstfit=0 bitcnt=2 bloom=1 bloom_bit_count=4 bloom_hash_count=24 bloom_oldest=0 cache=30 checkpoints=1 checksum=uncompressed chunk_size=1 compaction=0 compression=zlib data_extend=0 data_source=lsm delete_pct=18 dictionary=0 evict_max=5 file_type=row-store backups=0 huffman_key=0 huffman_value=0 insert_pct=83 internal_key_truncation=1 internal_page_max=14 isolation=read-uncommitted key_gap=1 key_max=122 key_min=10 leak_memory=0 leaf_page_max=11 logging=0 logging_archive=1 logging_prealloc=1 logging=0 lsm_worker_threads=4 merge_max=13 mmap=1 ops=100000 prefix_compression=1 prefix_compression_min=6 repeat_data_pct=54 reverse=0 rows=100000 runs=100 split_pct=67 statistics=0 statistics_server=0 threads=21 timer=0 value_max=3202 value_min=15 wiredtiger_config= write_pct=66 ############################################
The LSM tree has 20 active chunks. Of those chunks 5 are flushed, the rest are all in memory. The non-flushed chunks are filling the cache.
There are 4 LSM worker threads, one of which is the manager. One thread can only do switch and drop operations (that thread is idle), one of which is currently doing a merge, but stuck with cache full:
Thread 28 (Thread 0x7f6dfc378700 (LWP 110983)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 WT-1 0x00000000004402f6 in __wt_cond_wait (session=0x2525fe0, cond=0x25208c0, usecs=100000) at ../src/os_posix/os_mtx_cond.c:78 WT-2 0x000000000042c031 in __wt_cache_wait (session=0x2525fe0, full=131) at ../src/evict/evict_lru.c:1464 WT-3 0x00000000004dd1cf in __wt_cache_full_check (session=0x2525fe0) at ../src/include/cache.i:197 WT-4 0x00000000004de510 in __cursor_enter (session=0x2525fe0) at ../src/include/cursor.i:63 WT-5 0x00000000004de5d8 in __curfile_enter (cbt=0x7f6de4061b20) at ../src/include/cursor.i:96 WT-6 0x00000000004de791 in __cursor_func_init (cbt=0x7f6de4061b20, reenter=0) at ../src/include/cursor.i:198 WT-7 0x00000000004e00a9 in __wt_btcur_next (cbt=0x7f6de4061b20, truncating=0) at ../src/btree/bt_curnext.c:415 WT-8 0x00000000004b144b in __curfile_next (cursor=0x7f6de4061b20) at ../src/cursor/cur_file.c:113 WT-9 0x00000000004c7373 in __clsm_next (cursor=0x7f6de4183e10) ---Type <return> to continue, or q <return> to quit--- at ../src/lsm/lsm_cursor.c:795 WT-10 0x00000000004cb4cc in __wt_lsm_merge (session=0x2525fe0, lsm_tree=0x25034e0, id=2) at ../src/lsm/lsm_merge.c:346 WT-11 0x000000000043b497 in __lsm_worker (arg=0x2517920) at ../src/lsm/lsm_worker.c:138
One of which is creating a bloom filter, and is stuck waiting for the cache to get less full:
Thread 27 (Thread 0x7f6dfbb77700 (LWP 110984)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 WT-1 0x00000000004402f6 in __wt_cond_wait (session=0x25262e0, cond=0x25208c0, usecs=100000) at ../src/os_posix/os_mtx_cond.c:78 WT-2 0x000000000042c031 in __wt_cache_wait (session=0x25262e0, full=131) at ../src/evict/evict_lru.c:1464 WT-3 0x00000000004dd1cf in __wt_cache_full_check (session=0x25262e0) at ../src/include/cache.i:197 WT-4 0x00000000004de510 in __cursor_enter (session=0x25262e0) at ../src/include/cursor.i:63 WT-5 0x00000000004de5d8 in __curfile_enter (cbt=0x7f6da41c7800) at ../src/include/cursor.i:96 WT-6 0x00000000004de791 in __cursor_func_init (cbt=0x7f6da41c7800, reenter=0) at ../src/include/cursor.i:198 WT-7 0x00000000004e00a9 in __wt_btcur_next (cbt=0x7f6da41c7800, truncating=0) at ../src/btree/bt_curnext.c:415 WT-8 0x00000000004b144b in __curfile_next (cursor=0x7f6da41c7800) at ../src/cursor/cur_file.c:113 WT-9 0x00000000004c7373 in __clsm_next (cursor=0x7f6da4c21a20) at ../src/lsm/lsm_cursor.c:795 WT-10 0x00000000004ce9ad in __lsm_bloom_create (session=0x25262e0, lsm_tree=0x25034e0, chunk=0x7f6d48003d30, chunk_off=7) at ../src/lsm/lsm_work_unit.c:405 WT-11 0x00000000004ce124 in __wt_lsm_work_bloom (session=0x25262e0, lsm_tree=0x25034e0) at ../src/lsm/lsm_work_unit.c:224 WT-12 0x000000000043b2cf in __lsm_worker_general_op (session=0x25262e0, cookie=0x2517948, completed=0x7f6dfbb76ee0) at ../src/lsm/lsm_worker.c:74 WT-13 0x000000000043b3bf in __lsm_worker (arg=0x2517948) at ../src/lsm/lsm_worker.c:122
I think creating bloom filters doesn't expect to get stuck waiting for space in the cache. In
we set the
WT_SESSION_NO_CACHE_CHECK
flag when doing a post create traversal of the bloom filter. We don't set that flag when traversing the chunk to create the bloom filter itself, even though we set
WT_SESSION_NO_CACHE{{`}}.
Alternatively we could fiddle with the LSM worker thread work unit assignments, so that the thread that only does switches and drops (very short lived operations) could do flushes as well if we've stopped making progress. The difficulty would be in determining when we are and aren't making progress.
- is related to
-
WT-1722 Don't allow LSM bloom create to block waiting for space in the cache.
- Closed
- related to
-
WT-1 placeholder WT-1
- Closed
-
WT-2 What does metadata look like?
- Closed
-
WT-3 What file formats are required?
- Closed
-
WT-4 Flexible cursor traversals
- Closed
-
WT-5 How does pget work: is it necessary?
- Closed
-
WT-6 Complex schema example
- Closed
-
WT-7 Do we need the handle->err/errx methods?
- Closed
-
WT-8 Do we need table load, bulk-load and/or dump methods?
- Closed
-
WT-9 Does adding schema need to be transactional?
- Closed
-
WT-10 Basic "getting started" tutorial
- Closed
-
WT-11 placeholder #11
- Closed
-
WT-12 Write more examples
- Closed
-
WT-13 Define supported platforms
- Closed