Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-2353

Failure to create async threads as part of a wiredtiger_open call will cause a hang

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • WT2.9.0, 3.2.10, 3.3.11
    • Affects Version/s: None
    • Component/s: None
    • None

      I've been doing some experimentation with failure injection.

      During one of my experiments I found that WiredTiger had hung following the random injection of a calloc failure.

      Looking at the stack, I found the following:

      Thread 3 (Thread 0x7f62df85a700 (LWP 4986)):
      #0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
      #1  0x0000000000445fd2 in __wt_cond_wait_signal (session=0x7f62e18649d0, cond=0x15ec080, usecs=100000, signalled=0x7f62df859e9f) at ../src/os_posix/os_mtx_cond.c:82
      #2  0x0000000000428d79 in __wt_cond_wait (session=0x7f62e18649d0, cond=0x15ec080, usecs=100000) at ../src/include/misc.i:18
      #3  0x000000000042a24d in __evict_server (arg=0x7f62e18649d0) at ../src/evict/evict_lru.c:241
      #4  0x00007f62e0bfc555 in start_thread (arg=0x7f62df85a700) at pthread_create.c:333
      #5  0x00007f62e00f9b9d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
      
      Thread 2 (Thread 0x7f62df059700 (LWP 4987)):
      #0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
      #1  0x0000000000445fd2 in __wt_cond_wait_signal (session=0x7f62e1864d10, cond=0x161d590, usecs=10000000, signalled=0x7f62df058ebf) at ../src/os_posix/os_mtx_cond.c:82
      #2  0x000000000041ec48 in __wt_cond_wait (session=0x7f62e1864d10, cond=0x161d590, usecs=10000000) at ../src/include/misc.i:18
      #3  0x000000000041f8e1 in __sweep_server (arg=0x7f62e1864d10) at ../src/conn/conn_sweep.c:272
      #4  0x00007f62e0bfc555 in start_thread (arg=0x7f62df059700) at pthread_create.c:333
      #5  0x00007f62e00f9b9d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
      
      Thread 1 (Thread 0x7f62e1938740 (LWP 4985)):
      #0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
      #1  0x0000000000445fd2 in __wt_cond_wait_signal (session=0x0, cond=0x161d600, usecs=100000, signalled=0x7ffd9a6979ff) at ../src/os_posix/os_mtx_cond.c:82
      #2  0x000000000048e28a in __wt_cond_wait (session=0x0, cond=0x161d600, usecs=100000) at ../src/include/misc.i:18
      #3  0x000000000048fa8e in __wt_async_flush (session=0x7f62e1864010) at ../src/async/async_api.c:533
      #4  0x000000000041c6c4 in __wt_connection_close (conn=0x15da370) at ../src/conn/conn_open.c:104
      #5  0x00000000004155fd in wiredtiger_open (home=0x52a67d "WT_TEST", event_handler=0x0,
          config=0x15d9b30 "create,cache_size=21G,checkpoint_sync=false,mmap=false,session_max=1024,lsm_manager=(worker_thread_max=6),create,cache_size=21G,checkpoint_sync=false,mmap=false,session_max=1024,lsm_manager=(worker_th"...,
          wt_connp=0x7ffd9a697ed0) at ../src/conn/conn_api.c:2092
      #6  0x000000000040ac7c in start_run (cfg=0x7ffd9a697ea0) at ../../../bench/wtperf/wtperf.c:1947
      #7  0x000000000040a851 in start_all_runs (cfg=0x7ffd9a697ea0) at ../../../bench/wtperf/wtperf.c:1858
      #8  0x000000000040bf90 in main (argc=5, argv=0x7ffd9a6981b8) at ../../../bench/wtperf/wtperf.c:2322
      

      As I understand it, the failure was introduced when creating the async worker threads. This caused the wiredtiger_open call to go into error handling during its setup and call _wt_connection_close which in turn calls _wt_async_flush, which will never complete as there is no async thread to process the flush.

            Assignee:
            backlog-server-execution [DO NOT USE] Backlog - Storage Execution Team
            Reporter:
            david.hows David Hows
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: