Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-1915

Segfault running wtperf LSM workload

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: WT2.6.0
    • Labels:
      None
    • # Replies:
      6
    • Last comment by Customer:
      true

      Description

      There was a Jenkins test failure running the medium-multi-lsm wtperf workload:

      http://build.wiredtiger.com:8080/job/wiredtiger-perf-med-multi-lsm/941/console

      The failure doesn't reproduce immediately. The failure is:

      ../../../bench/wtperf/runners/wtperf_run.sh: line 147: 16623 Segmentation fault      (core dumped) LD_PRELOAD=/usr/lib64/libjemalloc.so.1 LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib ./wtperf -O $wttest
      

      The stack trace is:

      #0  0x0000000000419084 in __wt_cursor_init ()
      #1  0x000000000048e9ce in __wt_curfile_create ()
      #2  0x000000000048ebc9 in __wt_curfile_open ()
      #3  0x000000000044c568 in __wt_open_cursor ()
      #4  0x000000000049c04e in __clsm_open_cursors ()
      #5  0x00000000004a1b9f in __wt_clsm_init_merge ()
      #6  0x00000000004a2633 in __wt_lsm_merge ()
      #7  0x0000000000427877 in __lsm_worker ()
      #8  0x00007f82125f8f18 in start_thread () from /lib64/libpthread.so.0
      #9  0x00007f821232eb2d in clone () from /lib64/libc.so.6
      

      There are a number of other threads active concurrently:

      Thread 14 (Thread 0x7f820bffb700 (LWP 16634)):
      #0  0x00007f82123292c7 in ftruncate64 () from /lib64/libc.so.6
      #1  0x000000000042b4be in __wt_ftruncate ()
      #2  0x000000000045e831 in __wt_block_truncate ()
      #3  0x00000000004ccdc8 in __wt_block_checkpoint_unload ()
      #4  0x00000000004b6403 in __bm_checkpoint_unload ()
      #5  0x0000000000461972 in __wt_btree_close () at ../src/btree/bt_handle.c:147
      #6  0x00000000004848d0 in __wt_conn_btree_sync_and_close ()
      #7  0x000000000044de0c in __wt_session_release_btree ()
      #8  0x000000000048ce28 in __curfile_close ()
      #9  0x00000000004a2ac5 in __wt_lsm_merge ()
      #10 0x0000000000427877 in __lsm_worker ()
      #11 0x00007f82125f8f18 in start_thread () from /lib64/libpthread.so.0
      #12 0x00007f821232eb2d in clone () from /lib64/libc.so.6
       
      Thread 13 (Thread 0x7f820effa700 (LWP 16629)):
      #0  0x00007f8212322fe7 in unlink () from /lib64/libc.so.6
      #1  0x00007f82122b1259 in remove () from /lib64/libc.so.6
      #2  0x000000000042c7f6 in __wt_remove ()
      #3  0x00000000004a3fa8 in __lsm_drop_file ()
      #4  0x00000000004a4b1d in __wt_lsm_free_chunks ()
      #5  0x0000000000427959 in __lsm_worker ()
      #6  0x00007f82125f8f18 in start_thread () from /lib64/libpthread.so.0
      #7  0x00007f821232eb2d in clone () from /lib64/libc.so.6
       
       
      Thread 11 (Thread 0x7f820dfff700 (LWP 16630)):
      #0  0x00007f82125ff265 in __lll_lock_wait () from /lib64/libpthread.so.0
      #1  0x00007f82125fadc1 in _L_lock_816 () from /lib64/libpthread.so.0
      #2  0x00007f82125facc7 in pthread_mutex_lock () from /lib64/libpthread.so.0
      #3  0x000000000044e5b7 in __wt_session_get_btree ()
      #4  0x000000000044e732 in __wt_session_get_btree_ckpt ()
      #5  0x000000000048ecbd in __wt_curfile_open ()
      #6  0x000000000044c568 in __wt_open_cursor ()
      #7  0x00000000004b8f5b in __wt_bloom_hash_get ()
      #8  0x00000000004b8fc8 in __wt_bloom_get ()
      #9  0x00000000004a2bad in __wt_lsm_merge ()
      #10 0x0000000000427877 in __lsm_worker ()
      #11 0x00007f82125f8f18 in start_thread () from /lib64/libpthread.so.0
      #12 0x00007f821232eb2d in clone () from /lib64/libc.so.6
       
      Thread 8 (Thread 0x7f820d7fe700 (LWP 16631)):
      #0  0x00007f82125ff265 in __lll_lock_wait () from /lib64/libpthread.so.0
      #1  0x00007f82125fadc1 in _L_lock_816 () from /lib64/libpthread.so.0
      #2  0x00007f82125facc7 in pthread_mutex_lock () from /lib64/libpthread.so.0
      #3  0x000000000044b04f in __session_create ()
      #4  0x00000000004b8cdb in __wt_bloom_finalize ()
      #5  0x00000000004a456f in __wt_lsm_work_bloom ()
      #6  0x0000000000427995 in __lsm_worker ()
      #7  0x00007f82125f8f18 in start_thread () from /lib64/libpthread.so.0
      #8  0x00007f821232eb2d in clone () from /lib64/libc.so.6
       
      Thread 7 (Thread 0x7f820c7fc700 (LWP 16633)):
      #0  0x00007f8212f248df in bitmap_sfu (arena=0x7f8211c51ac0, tbin=0x7f8202806088, binind=3, prof_accumbytes=<value optimized out>)
          at include/jemalloc/internal/bitmap.h:137
      #1  arena_run_reg_alloc (arena=0x7f8211c51ac0, tbin=0x7f8202806088, binind=3, prof_accumbytes=<value optimized out>) at src/arena.c:325
      #2  arena_tcache_fill_small (arena=0x7f8211c51ac0, tbin=0x7f8202806088, binind=3, prof_accumbytes=<value optimized out>) at src/arena.c:1348
      #3  0x00007f8212f3d6ff in tcache_alloc_small_hard (tcache=<value optimized out>, tbin=0x7f8202806088, binind=<value optimized out>) at src/tcache.c:72
      #4  0x00007f8212f1d85a in tcache_alloc_small (num=<value optimized out>, size=<value optimized out>) at include/jemalloc/internal/tcache.h:302
      #5  arena_malloc (num=<value optimized out>, size=<value optimized out>) at include/jemalloc/internal/arena.h:916
      #6  icallocx (num=<value optimized out>, size=<value optimized out>) at include/jemalloc/internal/jemalloc_internal.h:800
      #7  icalloc (num=<value optimized out>, size=<value optimized out>) at include/jemalloc/internal/jemalloc_internal.h:809
      #8  calloc (num=<value optimized out>, size=<value optimized out>) at src/jemalloc.c:1079
      #9  0x0000000000429d60 in __wt_calloc ()
      #10 0x0000000000465eb3 in __wt_page_alloc ()
      #11 0x0000000000465fec in __wt_page_inmem ()
      #12 0x0000000000468568 in __wt_cache_read ()
      #13 0x0000000000465933 in __wt_page_in_func ()
      #14 0x000000000047866d in __wt_tree_walk ()
      #15 0x00000000004b9a86 in __wt_btcur_next ()
      #16 0x000000000048c9fa in __curfile_next ()
      #17 0x000000000049db36 in __clsm_next ()
      #18 0x00000000004a4542 in __wt_lsm_work_bloom ()
      #19 0x0000000000427995 in __lsm_worker ()
      #20 0x00007f82125f8f18 in start_thread () from /lib64/libpthread.so.0
      #21 0x00007f821232eb2d in clone () from /lib64/libc.so.6
      

      The failure happened towards the end of a load phase. The final three lines of WT_TEST/test.stat:

      579597 populate inserts (44372348 of 50000000) in 5 secs (180 total secs)
      379201 populate inserts (44751549 of 50000000) in 5 secs (185 total secs)
      984072 populate inserts (45735621 of 50000000) in 5 secs (190 total secs)
      

        Activity

        Hide
        michael.cahill Michael Cahill added a comment -

        A possible culprit here is recent changes to the sweep code and discard of obsolete LSM handles.

        e.g., maybe the dhandle we're trying to open a cursor on has been deleted before it was removed from a session cache?

        Show
        michael.cahill Michael Cahill added a comment - A possible culprit here is recent changes to the sweep code and discard of obsolete LSM handles. e.g., maybe the dhandle we're trying to open a cursor on has been deleted before it was removed from a session cache?
        Hide
        alexander.gorrod Alexander Gorrod added a comment -

        This reproduced after ~6 runs. The stack trace with symbols is:

        (gdb) where
        #0  0x00000000004212dd in __wt_cursor_init (cursor=0x7f1cdc3673c0, 
            uri=0x7f1ccc1a5e20 "file:test00001-000373.lsm", owner=0x7f1cdc365ca0, 
            cfg=0x7f1cedbdac60, cursorp=0x7f1cdc284eb8) at ../src/cursor/cur_std.c:611
        #1  0x00000000004b400f in __wt_curfile_create (session=0x1b39c10, 
            owner=0x7f1cdc365ca0, cfg=0x7f1cedbdac60, bulk=0, bitmap=0, 
            cursorp=0x7f1cdc284eb8) at ../src/cursor/cur_file.c:464
        #2  0x00000000004b434e in __wt_curfile_open (session=0x1b39c10, 
            uri=0x7f1cd4359380 "file:test00001-000373.lsm", owner=0x7f1cdc365ca0, 
            cfg=0x7f1cedbdac60, cursorp=0x7f1cdc284eb8) at ../src/cursor/cur_file.c:526
        #3  0x000000000045bb43 in __wt_open_cursor (session=0x1b39c10, 
            uri=0x7f1cd4359380 "file:test00001-000373.lsm", owner=0x7f1cdc365ca0, 
            cfg=0x7f1cedbdac60, cursorp=0x7f1cdc284eb8)
            at ../src/session/session_api.c:274
        #4  0x00000000004c71a7 in __clsm_open_cursors (clsm=0x7f1cdc365ca0, update=0, 
            start_chunk=9, start_id=357) at ../src/lsm/lsm_cursor.c:581
        #5  0x00000000004c7787 in __wt_clsm_init_merge (cursor=0x7f1cdc365ca0, 
            start_chunk=9, start_id=357, nchunks=17) at ../src/lsm/lsm_cursor.c:693
        #6  0x00000000004cbdba in __wt_lsm_merge (session=0x1b39c10, 
            lsm_tree=0x1c3f630, id=2) at ../src/lsm/lsm_merge.c:323
        #7  0x00000000004348f9 in __lsm_worker (arg=0x1b34a38)
            at ../src/lsm/lsm_worker.c:138
        

        The cursor has NULL key and value format fields:

        (gdb) p *cursor
        $1 = {session = 0x1b39c10, uri = 0x0, key_format = 0x0, value_format = 0x0,
          internal_uri = 0x7f1ccc1a5e20 "file:test00001-000373.lsm", flags = 0}
        

        Looks as though these have been copied from the (recently opened) btree:

        (gdb) p session->dhandle
        $3 = (WT_DATA_HANDLE *) 0x7f1ccc925570
        (gdb) p session->dhandle->name
        $4 = 0x7f1ccc1a5e20 "file:test00001-000373.lsm"
        (gdb) p (WT_BTREE*)session->dhandle->handle
        $5 = (WT_BTREE *) 0x7f1ccc751120
        (gdb) p $5->key_format
        $6 = 0x0
        (gdb) p $5->value_format
        $7 = 0x0
        (gdb) p /x session->dhandle->flags
        $8 = 0x20
        

        Show
        alexander.gorrod Alexander Gorrod added a comment - This reproduced after ~6 runs. The stack trace with symbols is: (gdb) where #0 0x00000000004212dd in __wt_cursor_init (cursor=0x7f1cdc3673c0, uri=0x7f1ccc1a5e20 "file:test00001-000373.lsm", owner=0x7f1cdc365ca0, cfg=0x7f1cedbdac60, cursorp=0x7f1cdc284eb8) at ../src/cursor/cur_std.c:611 #1 0x00000000004b400f in __wt_curfile_create (session=0x1b39c10, owner=0x7f1cdc365ca0, cfg=0x7f1cedbdac60, bulk=0, bitmap=0, cursorp=0x7f1cdc284eb8) at ../src/cursor/cur_file.c:464 #2 0x00000000004b434e in __wt_curfile_open (session=0x1b39c10, uri=0x7f1cd4359380 "file:test00001-000373.lsm", owner=0x7f1cdc365ca0, cfg=0x7f1cedbdac60, cursorp=0x7f1cdc284eb8) at ../src/cursor/cur_file.c:526 #3 0x000000000045bb43 in __wt_open_cursor (session=0x1b39c10, uri=0x7f1cd4359380 "file:test00001-000373.lsm", owner=0x7f1cdc365ca0, cfg=0x7f1cedbdac60, cursorp=0x7f1cdc284eb8) at ../src/session/session_api.c:274 #4 0x00000000004c71a7 in __clsm_open_cursors (clsm=0x7f1cdc365ca0, update=0, start_chunk=9, start_id=357) at ../src/lsm/lsm_cursor.c:581 #5 0x00000000004c7787 in __wt_clsm_init_merge (cursor=0x7f1cdc365ca0, start_chunk=9, start_id=357, nchunks=17) at ../src/lsm/lsm_cursor.c:693 #6 0x00000000004cbdba in __wt_lsm_merge (session=0x1b39c10, lsm_tree=0x1c3f630, id=2) at ../src/lsm/lsm_merge.c:323 #7 0x00000000004348f9 in __lsm_worker (arg=0x1b34a38) at ../src/lsm/lsm_worker.c:138 The cursor has NULL key and value format fields: (gdb) p *cursor $1 = {session = 0x1b39c10, uri = 0x0, key_format = 0x0, value_format = 0x0, internal_uri = 0x7f1ccc1a5e20 "file:test00001-000373.lsm", flags = 0} Looks as though these have been copied from the (recently opened) btree: (gdb) p session->dhandle $3 = (WT_DATA_HANDLE *) 0x7f1ccc925570 (gdb) p session->dhandle->name $4 = 0x7f1ccc1a5e20 "file:test00001-000373.lsm" (gdb) p (WT_BTREE*)session->dhandle->handle $5 = (WT_BTREE *) 0x7f1ccc751120 (gdb) p $5->key_format $6 = 0x0 (gdb) p $5->value_format $7 = 0x0 (gdb) p /x session->dhandle->flags $8 = 0x20
        Hide
        alexander.gorrod Alexander Gorrod added a comment -

        A possible culprit here is recent changes to the sweep code and discard of obsolete LSM handles.

        I think it's more likely to be the handle open changes, the particular chunk that is being opened in a strange state doesn't have WT_DHANDLE_DEAD set on it, and is one of the most recent chunks in the tree:

        -rw-rw-r-- 1 alexg alexg     135168 May  4 07:04 test00001-000372.bf
        -rw-rw-r-- 1 alexg alexg    7811072 May  4 07:04 test00001-000372.lsm
        -rw-rw-r-- 1 alexg alexg     139264 May  4 07:04 test00001-000373.bf
        -rw-rw-r-- 1 alexg alexg    7929856 May  4 07:04 test00001-000373.lsm
        -rw-rw-r-- 1 alexg alexg     221184 May  4 07:04 test00001-000374.bf
        -rw-rw-r-- 1 alexg alexg   13348864 May  4 07:04 test00001-000374.lsm
        -rw-rw-r-- 1 alexg alexg    9801728 May  4 07:04 test00001-000375.lsm
        -rw-rw-r-- 1 alexg alexg       4096 May  4 07:04 test00001-000376.lsm
        

        Show
        alexander.gorrod Alexander Gorrod added a comment - A possible culprit here is recent changes to the sweep code and discard of obsolete LSM handles. I think it's more likely to be the handle open changes, the particular chunk that is being opened in a strange state doesn't have WT_DHANDLE_DEAD set on it, and is one of the most recent chunks in the tree: -rw-rw-r-- 1 alexg alexg 135168 May 4 07:04 test00001-000372.bf -rw-rw-r-- 1 alexg alexg 7811072 May 4 07:04 test00001-000372.lsm -rw-rw-r-- 1 alexg alexg 139264 May 4 07:04 test00001-000373.bf -rw-rw-r-- 1 alexg alexg 7929856 May 4 07:04 test00001-000373.lsm -rw-rw-r-- 1 alexg alexg 221184 May 4 07:04 test00001-000374.bf -rw-rw-r-- 1 alexg alexg 13348864 May 4 07:04 test00001-000374.lsm -rw-rw-r-- 1 alexg alexg 9801728 May 4 07:04 test00001-000375.lsm -rw-rw-r-- 1 alexg alexg 4096 May 4 07:04 test00001-000376.lsm
        Hide
        xgen-internal-githook Githook User added a comment -

        Author:

        {u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexg@wiredtiger.com'}

        Message: Fix a bug in handle close, where we could fail to clear WT_DHANDLE_OPEN.

        Specifically if closing a checkpoint handle. This manifested by a
        segfault opening a handle for a merge in LSM.

        Refs WT-1915
        Branch: develop
        https://github.com/wiredtiger/wiredtiger/commit/48c998f4b7ebd54e44c1f0c0db22b9fc4c06cfd7

        Show
        xgen-internal-githook Githook User added a comment - Author: {u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexg@wiredtiger.com'} Message: Fix a bug in handle close, where we could fail to clear WT_DHANDLE_OPEN. Specifically if closing a checkpoint handle. This manifested by a segfault opening a handle for a merge in LSM. Refs WT-1915 Branch: develop https://github.com/wiredtiger/wiredtiger/commit/48c998f4b7ebd54e44c1f0c0db22b9fc4c06cfd7
        Hide
        michael.cahill Michael Cahill added a comment -
        Show
        michael.cahill Michael Cahill added a comment - Fixed by merge of https://github.com/wiredtiger/wiredtiger/pull/1936
        Hide
        michael.cahill Michael Cahill added a comment - - edited

        Sue LoVerso, this was the issue I was trying to remember when we were talking through WT-2164 today. On review, I don't see anything here that explains what you're seeing.

        Alexander Gorrod, just fyi: Sue LoVerso saw a checkpoint handle with WT_DHANDLE_OPEN set, we were evicting a page from it, but it seemed as if the page's disk image had been corrupted (either zeroed out or with a suspicious memsize).

        Show
        michael.cahill Michael Cahill added a comment - - edited Sue LoVerso , this was the issue I was trying to remember when we were talking through WT-2164 today. On review, I don't see anything here that explains what you're seeing. Alexander Gorrod , just fyi: Sue LoVerso saw a checkpoint handle with WT_DHANDLE_OPEN set, we were evicting a page from it, but it seemed as if the page's disk image had been corrupted (either zeroed out or with a suspicious memsize ).

          People

          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:
              Days since reply:
              1 year, 32 weeks ago
              Date of 1st Reply: