Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-628

Deadlock when cache is full

    XMLWordPrintable

    Details

    • Type: Task
    • Status: Closed
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: WT2.0
    • Component/s: None
    • Labels:
      None

      Description

      When running the 'small' configuration on the leveldb benchmark, I am seeing a deadlock hang. Reproduce by:

       DYLD_LIBRARY_PATH=../wiredtiger/build_posix/.libs:../wiredtiger/build_posix/ext/compressors/snappy/.libs/ ./db_bench_wiredtiger --cache_size=6537216 --threads=1 --benchmarks=fill100K

      There is only 1 application thread. The threads are hung on the lsm_tree->rwlock. Here is the stack of relevant threads.

      (gdb) thread apply all bt
       
      Thread 7 (process 48237):
      #0  0x00007fff9477a0fa in __psynch_cvwait ()
      WT-1  0x00007fff96b3cfe9 in _pthread_cond_wait ()
      WT-2  0x000000010ca68ed9 in __wt_cond_wait (session=0x7fdc43005880, cond=0x7fdc42800840, usecs=10000) at os_mtx.c:75
      WT-3  0x000000010c9e0581 in __wt_cache_full_check (session=0x7fdc43005880, onepass=1) at cache.i:87
      WT-4  0x000000010c9e00e0 in __cursor_leave (cbt=0x7fdc414048c0) at cursor.i:86
      WT-5  0x000000010c9e353a in __wt_btcur_close (cbt=0x7fdc414048c0) at bt_cursor.c:742
      WT-6  0x000000010ca40b6e in __curfile_close (cursor=0x7fdc414048c0) at cur_file.c:298
      WT-7  0x000000010ca54323 in __clsm_close_cursors (clsm=0x7fdc41406100, update=1, skip_chunks=0) at lsm_cursor.c:100
      WT-8  0x000000010ca546e9 in __clsm_open_cursors (clsm=0x7fdc41406100, update=1, start_chunk=0, start_id=0) at lsm_cursor.c:182
      WT-9  0x000000010ca55d2c in __clsm_enter (clsm=0x7fdc41406100, update=1) at lsm_cursor.c:48
      WT-10 0x000000010ca576d9 in __clsm_insert (cursor=0x7fdc41406100) at lsm_cursor.c:945
      WT-11 0x000000010c950ad7 in DoWrite (this=0x7fff532b4870, thread=0x7fdc43014400, seq=false) at db_bench_wiredtiger.cc:917
      WT-12 0x000000010c951002 in WriteRandom (this=0x7fff532b4870, thread=0x7fdc43014400) at db_bench_wiredtiger.cc:868
      WT-13 0x000000010c955769 in leveldb::Benchmark::ThreadBody (v=0x7fdc42800200) at db_bench_wiredtiger.cc:661
      WT-14 0x000000010c978772 in leveldb::(anonymous namespace)::StartThreadWrapper () at stl_vector.h:271
      WT-15 0x00007fff96b387a2 in _pthread_start ()
      WT-16 0x00007fff96b251e1 in thread_start ()
       
      Thread 6 (process 48237):
      #0  0x00007fff9477a1ae in __psynch_rw_wrlock ()
      WT-1  0x00007fff96b3eea6 in pthread_rwlock_wrlock ()
      WT-2  0x000000010ca69741 in __wt_writelock (session=0x7fdc43005aa0, rwlock=0x7fdc42803360) at os_mtx.c:239
      WT-3  0x000000010ca611b6 in __wt_lsm_checkpoint_worker (arg=0x7fdc4302d600) at lsm_worker.c:291
      WT-4  0x00007fff96b387a2 in _pthread_start ()
      WT-5  0x00007fff96b251e1 in thread_start ()
       
      Thread 5 (process 48237):
      #0  0x00007fff9477a1ae in __psynch_rw_wrlock ()
      WT-1  0x00007fff96b3eea6 in pthread_rwlock_wrlock ()
      WT-2  0x000000010ca69741 in __wt_writelock (session=0x7fdc43005cc0, rwlock=0x7fdc42803360) at os_mtx.c:239
      WT-3  0x000000010ca59b9c in __wt_lsm_merge (session=0x7fdc43005cc0, lsm_tree=0x7fdc4302d600, id=0, stalls=1) at lsm_merge.c:99
      WT-4  0x000000010ca608db in __wt_lsm_merge_worker (vargs=0x7fdc428038b0) at lsm_worker.c:87
      WT-5  0x00007fff96b387a2 in _pthread_start ()
      WT-6  0x00007fff96b251e1 in thread_start ()
       
      Thread 2 (process 48237):
      #0  0x00007fff9477a0fa in __psynch_cvwait ()
      WT-1  0x00007fff96b3cfe9 in _pthread_cond_wait ()
      WT-2  0x000000010ca68ed9 in __wt_cond_wait (session=0x7fdc43005220, cond=0x7fdc428007c0, usecs=100000) at os_mtx.c:75
      WT-3  0x000000010c9e90c0 in __wt_cache_evict_server (arg=0x7fdc43005220) at bt_evict.c:167
      WT-4  0x00007fff96b387a2 in _pthread_start ()
      WT-5  0x00007fff96b251e1 in thread_start ()
      (gdb)

      Thread 7 takes the lsm_tree->rwlock for reading in __clsm_close_cursors. Both WT_EVICT_STUCK and WT_EVICT_NO_PROGRESS are set. The application thread is stuck in *wt_cache_full_check even though one_pass is 1, because the call to *wt_evict_lru_page continually returns WT_NOTFOUND.

      So, thread 7 is stuck holding the read lock in the forever loop in *wt_cache_full_check because *wt_evict_lru_page never returns anything other than WT_NOTFOUND. The lsm merge thread cannot make progress because it cannot get the writelock. The evict thread never finds anything it can evict.

      This is different than the no-progress report Alex mentions in WT-573 .

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                michael.cahill Michael Cahill
                Reporter:
                sue.loverso Sue LoVerso
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: