I've been doing some experimentation with failure injection.
During one of my experiments I found that WiredTiger had hung following the random injection of a calloc failure.
Looking at the stack, I found the following:
Thread 3 (Thread 0x7f62df85a700 (LWP 4986)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 #1 0x0000000000445fd2 in __wt_cond_wait_signal (session=0x7f62e18649d0, cond=0x15ec080, usecs=100000, signalled=0x7f62df859e9f) at ../src/os_posix/os_mtx_cond.c:82 #2 0x0000000000428d79 in __wt_cond_wait (session=0x7f62e18649d0, cond=0x15ec080, usecs=100000) at ../src/include/misc.i:18 #3 0x000000000042a24d in __evict_server (arg=0x7f62e18649d0) at ../src/evict/evict_lru.c:241 #4 0x00007f62e0bfc555 in start_thread (arg=0x7f62df85a700) at pthread_create.c:333 #5 0x00007f62e00f9b9d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 Thread 2 (Thread 0x7f62df059700 (LWP 4987)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 #1 0x0000000000445fd2 in __wt_cond_wait_signal (session=0x7f62e1864d10, cond=0x161d590, usecs=10000000, signalled=0x7f62df058ebf) at ../src/os_posix/os_mtx_cond.c:82 #2 0x000000000041ec48 in __wt_cond_wait (session=0x7f62e1864d10, cond=0x161d590, usecs=10000000) at ../src/include/misc.i:18 #3 0x000000000041f8e1 in __sweep_server (arg=0x7f62e1864d10) at ../src/conn/conn_sweep.c:272 #4 0x00007f62e0bfc555 in start_thread (arg=0x7f62df059700) at pthread_create.c:333 #5 0x00007f62e00f9b9d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 Thread 1 (Thread 0x7f62e1938740 (LWP 4985)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 #1 0x0000000000445fd2 in __wt_cond_wait_signal (session=0x0, cond=0x161d600, usecs=100000, signalled=0x7ffd9a6979ff) at ../src/os_posix/os_mtx_cond.c:82 #2 0x000000000048e28a in __wt_cond_wait (session=0x0, cond=0x161d600, usecs=100000) at ../src/include/misc.i:18 #3 0x000000000048fa8e in __wt_async_flush (session=0x7f62e1864010) at ../src/async/async_api.c:533 #4 0x000000000041c6c4 in __wt_connection_close (conn=0x15da370) at ../src/conn/conn_open.c:104 #5 0x00000000004155fd in wiredtiger_open (home=0x52a67d "WT_TEST", event_handler=0x0, config=0x15d9b30 "create,cache_size=21G,checkpoint_sync=false,mmap=false,session_max=1024,lsm_manager=(worker_thread_max=6),create,cache_size=21G,checkpoint_sync=false,mmap=false,session_max=1024,lsm_manager=(worker_th"..., wt_connp=0x7ffd9a697ed0) at ../src/conn/conn_api.c:2092 #6 0x000000000040ac7c in start_run (cfg=0x7ffd9a697ea0) at ../../../bench/wtperf/wtperf.c:1947 #7 0x000000000040a851 in start_all_runs (cfg=0x7ffd9a697ea0) at ../../../bench/wtperf/wtperf.c:1858 #8 0x000000000040bf90 in main (argc=5, argv=0x7ffd9a6981b8) at ../../../bench/wtperf/wtperf.c:2322
As I understand it, the failure was introduced when creating the async worker threads. This caused the wiredtiger_open call to go into error handling during its setup and call _wt_connection_close which in turn calls _wt_async_flush, which will never complete as there is no async thread to process the flush.