This ticket describes the a new cache stuck failure happening in test/format. This follows on from the recent work performed in WT-7010. It is moved here, because the original error in WT-7010 has already been fixed. The recent error that caused WT-7010 to re-open again describes a different cache stuck issue.
The cache stuck symptoms:
[2021/08/07 13:43:24.436] [1628343522:178826][29521:0x703706aef160], eviction-server: __evict_server, 452: Cache stuck for too long, giving up: Connection timed out [2021/08/07 13:43:24.436] 0x1002f9679e0:=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= [2021/08/07 13:43:24.436] 0x1002f9679e0:transaction state dump [2021/08/07 13:43:24.436] 0x1002f9679e0:current ID: 20942 [2021/08/07 13:43:24.436] 0x1002f9679e0:last running ID: 20942 [2021/08/07 13:43:24.436] 0x1002f9679e0:metadata_pinned ID: 20855 [2021/08/07 13:43:24.436] 0x1002f9679e0:oldest ID: 20942 [2021/08/07 13:43:24.436] 0x1002f9679e0:durable timestamp: (0, 0) [2021/08/07 13:43:24.436] 0x1002f9679e0:oldest timestamp: (0, 0) [2021/08/07 13:43:24.436] 0x1002f9679e0:pinned timestamp: (0, 0) [2021/08/07 13:43:24.436] 0x1002f9679e0:stable timestamp: (0, 0) [2021/08/07 13:43:24.436] 0x1002f9679e0:has_durable_timestamp: no [2021/08/07 13:43:24.436] 0x1002f9679e0:has_oldest_timestamp: no [2021/08/07 13:43:24.436] 0x1002f9679e0:has_pinned_timestamp: no [2021/08/07 13:43:24.436] 0x1002f9679e0:has_stable_timestamp: no [2021/08/07 13:43:24.436] 0x1002f9679e0:oldest_is_pinned: no [2021/08/07 13:43:24.436] 0x1002f9679e0:stable_is_pinned: no [2021/08/07 13:43:24.436] 0x1002f9679e0:checkpoint running: yes [2021/08/07 13:43:24.436] 0x1002f9679e0:checkpoint generation: 4 [2021/08/07 13:43:24.436] 0x1002f9679e0:checkpoint pinned ID: 20764 [2021/08/07 13:43:24.436] 0x1002f9679e0:checkpoint txn ID: 20855 [2021/08/07 13:43:24.436] 0x1002f9679e0:session count: 43 [2021/08/07 13:43:24.436] 0x1002f9679e0:Transaction state of active sessions: [2021/08/07 13:43:24.436] 0x1002f9679e0:=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= [2021/08/07 13:43:24.436] 0x1002f9679e0:cache dump [2021/08/07 13:43:24.436] 0x1002f9679e0:cache full: no [2021/08/07 13:43:24.436] 0x1002f9679e0:cache clean check: no (79.574%) [2021/08/07 13:43:24.436] 0x1002f9679e0:cache dirty check: no (11.160%) [2021/08/07 13:43:24.436] 0x1002f9679e0:cache updates check: yes (10.726%) [2021/08/07 13:43:24.436] 0x1002f9679e0:file:wt.wt(<live>): [2021/08/07 13:43:24.436] 0x1002f9679e0:internal: 11629 pages, 83MB, 10032/1597 clean/dirty pages, 72/10 clean/dirty MB, 0MB max page, 0MB max dirty page [2021/08/07 13:43:24.436] 0x1002f9679e0:leaf: 107005 pages, 87MB, 88923/18082 clean/dirty pages, 63/23/23 clean/dirty/updates MB, 0MB max page, 0MB max dirty page [2021/08/07 13:43:24.436] 0x1002f9679e0:file:WiredTigerHS.wt(<live>): [2021/08/07 13:43:24.436] 0x1002f9679e0:internal: 1 pages, 0MB, 0/1 clean/dirty pages, 0/0 clean/dirty MB, 0MB max page, 0MB max dirty page [2021/08/07 13:43:24.436] 0x1002f9679e0:leaf: 1 pages, 0MB, 1/0 clean/dirty pages, 0/0/0 clean/dirty/updates MB, 0MB max page, 0MB max dirty page [2021/08/07 13:43:24.436] 0x1002f9679e0:file:WiredTiger.wt(<live>): [2021/08/07 13:43:24.436] 0x1002f9679e0:internal: 1 pages, 0MB, 1/0 clean/dirty pages, 0/0 clean/dirty MB, 0MB max page, 0MB max dirty page [2021/08/07 13:43:24.436] 0x1002f9679e0:leaf: 0 pages [2021/08/07 13:43:24.436] 0x1002f9679e0:cache dump: total found: 184MB vs tracked inuse 170MB [2021/08/07 13:43:24.436] 0x1002f9679e0:total dirty bytes: 34MB vs tracked dirty 34MB [2021/08/07 13:43:24.436] 0x1002f9679e0:total updates bytes: 23MB vs tracked updates 24MB [2021/08/07 13:43:24.436] [1628343522:242187][29521:0x703706aef160], eviction-server: __wt_evict_thread_run, 324: cache eviction thread error: Connection timed out [2021/08/07 13:43:24.436] [1628343522:242212][29521:0x703706aef160], eviction-server: __wt_evict_thread_run, 324: the process must exit and restart: WT_PANIC: WiredTiger library panic [2021/08/07 13:43:24.436] [1628343522:242222][29521:0x703706aef160], eviction-server: __wt_abort, 28: aborting WiredTiger library
The symptoms of the cache failure here is shown through the backtrace:
#0 __libc_pwrite64 (offset=<optimized out>, count=1024, buf=0x7fac848c9000, fd=8) at ../sysdeps/unix/sysv/linux/pwrite64.c:29 #1 __libc_pwrite64 (fd=8, buf=buf@entry=0x7fac848c9000, count=count@entry=1024, offset=offset@entry=154221056) at ../sysdeps/unix/sysv/linux/pwrite64.c:27 #2 0x00000000004ccf43 in __posix_file_write (file_handle=0x210e7c0, wt_session=0x209b5c0, offset=154221056, len=1024, buf=<optimized out>) at ../src/os_posix/os_fs.c:610 #3 0x0000000000549235 in __wt_write (buf=0x7fac848c9000, len=1024, offset=<optimized out>, fh=<optimized out>, session=0x209b5c0) at ../src/include/os_fhandle_inline.h:182 #4 __block_write_off (caller_locked=<optimized out>, checkpoint_io=<optimized out>, data_checksum=<optimized out>, checksump=0x7fac767fb404, sizep=0x7fac767fb40c, offsetp=0x7fac767fb410, objectidp=<optimized out>, buf=<optimized out>, block=0xa0254a0, session=0x209b5c0) at ../src/block/block_write.c:327 #5 __wt_block_write_off (session=0x209b5c0, block=0xa0254a0, buf=<optimized out>, objectidp=0x7fac767fb408, offsetp=0x7fac767fb410, sizep=0x7fac767fb40c, checksump=0x7fac767fb404, data_checksum=false, checkpoint_io=true, caller_locked=false) at ../src/block/block_write.c:389 #6 0x0000000000549b39 in __wt_block_write (session=<optimized out>, block=0xa0254a0, buf=<optimized out>, addr=0x7fac767fb590 ")", addr_sizep=0x7fac767fb580, data_checksum=<optimized out>, checkpoint_io=true) at ../src/block/block_write.c:195 #7 0x000000000045a249 in __wt_bt_write (session=session@entry=0x209b5c0, buf=buf@entry=0x7fac8433e9e0, addr=addr@entry=0x7fac767fb590 ")", addr_sizep=addr_sizep@entry=0x7fac767fb580, compressed_sizep=compressed_sizep@entry=0x7fac767fb588, checkpoint=checkpoint@entry=false, checkpoint_io=<optimized out>, compressed=<optimized out>) at ../src/btree/bt_io.c:351 #8 0x00000000004e1417 in __rec_split_write (session=0x209b5c0, r=0x7fac8433e880, chunk=<optimized out>, last_block=<optimized out>, compressed_image=0x0) at ../src/reconcile/rec_write.c:1863 #9 0x0000000000636019 in __wt_rec_row_leaf (session=session@entry=0x209b5c0, r=r@entry=0x7fac8433e880, pageref=pageref@entry=0x7fac6006b5e0, salvage=salvage@entry=0x0) at ../src/reconcile/rec_row.c:1069 #10 0x00000000004e4bb2 in __reconcile (page_lockedp=<synthetic pointer>, flags=68, salvage=0x0, ref=0x7fac6006b5e0, session=0x209b5c0) at ../src/reconcile/rec_write.c:194 #11 __wt_reconcile (session=session@entry=0x209b5c0, ref=ref@entry=0x7fac6006b5e0, salvage=salvage@entry=0x0, flags=flags@entry=68) at ../src/reconcile/rec_write.c:99 #12 0x00000000005b4cc0 in __wt_sync_file (session=session@entry=0x209b5c0, syncop=syncop@entry=WT_SYNC_CHECKPOINT) at ../src/btree/bt_sync.c:657 #13 0x0000000000526ca1 in __checkpoint_tree (session=0x209b5c0, is_checkpoint=<optimized out>, cfg=<optimized out>) at ../src/txn/txn_ckpt.c:1878 #14 0x0000000000527bf5 in __checkpoint_tree_helper (cfg=0x7fac767fbe20, session=0x209b5c0) at ../src/txn/txn_ckpt.c:1994 #15 __checkpoint_apply_to_dhandles (op=<optimized out>, cfg=<optimized out>, session=<optimized out>) at ../src/txn/txn_ckpt.c:198 #16 __txn_checkpoint (session=0x209b5c0, cfg=0x7fac767fbe20) at ../src/txn/txn_ckpt.c:920 #17 0x000000000052969c in __txn_checkpoint_wrapper (cfg=0x7fac767fbe20, session=0x209b5c0) at ../src/txn/txn_ckpt.c:1163 #18 __wt_txn_checkpoint (session=session@entry=0x209b5c0, cfg=cfg@entry=0x7fac767fbe20, waiting=waiting@entry=true) at ../src/txn/txn_ckpt.c:1217 #19 0x000000000041b333 in __session_checkpoint (wt_session=0x209b5c0, config=0x64ccea "drop=(all)") at ../src/session/session_api.c:1878 #20 0x0000000000439ba1 in checkpoint (arg=<optimized out>) at ../../../test/format/checkpoint.c:121 #21 0x00007faca6dd2609 in start_thread (arg=<optimized out>) at pthread_create.c:477 #22 0x00007faca6cd0293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Aborting Thread:
Thread 1 (Thread 0x7faca6052700 (LWP 74586)): #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x00007faca6bd3859 in __GI_abort () at abort.c:79 #2 0x0000000000413c60 in __wt_abort (session=session@entry=0x2095f20) at ../src/os_common/os_abort.c:30 #3 0x000000000041c6d6 in __wt_panic_func (session=session@entry=0x2095f20, error=error@entry=110, func=func@entry=0x660cb0 <__PRETTY_FUNCTION__.46687> "__wt_evict_thread_run", line=line@entry=324, fmt=fmt@entry=0x66059f "cache eviction thread error") at ../src/support/err.c:387 #4 0x000000000040ed04 in __wt_evict_thread_run (session=0x2095f20, thread=0x20c9550) at ../src/evict/evict_lru.c:324 #5 0x0000000000518eb1 in __thread_run (arg=0x20c9550) at ../src/support/thread_group.c:31 #6 0x00007faca6dd2609 in start_thread (arg=<optimized out>) at pthread_create.c:477 #7 0x00007faca6cd0293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95