Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-1261

Test format stress failure - hazard pointer assertion

    • Type: Icon: Task Task
    • Resolution: Done
    • None
    • Affects Version/s: None
    • Component/s: None

      test/format failed on tinderbox a couple days ago.
      http://mjc.homeunix.org:8180/job/wiredtiger-test-format-stress/5419/

      This console says:

      [1412512381:36088][42905:00577afe927f0000], t, file:wt, eviction-worker: discarded page has hazard pointer: (0x7f92dc209810: ../src/btree/col_srch.c, line 95)
      [1412512381:36142][42905:00577afe927f0000], t, file:wt, eviction-worker: ../src/btree/bt_discard.c, 67: hp == NULL
      [1412512381:36148][42905:00577afe927f0000], t, file:wt, eviction-worker: aborting WiredTiger library
      

      Although the hp->page is listed in the output, in the core file it is already NULL:

      (gdb) p hp
      $1 = (WT_HAZARD *) 0x2788130
      (gdb) p *hp
      $2 = {page = 0x0, file = 0x67364d "../src/btree/col_srch.c", line = 95}
      (gdb) p *(WT_PAGE*)0x7f92dc209810
      $3 = {u = {intl = {recno = 28400, parent_ref = 0x7f92dc209858, __index = 0x7f92dc040a10}, row = {d = 0x6ef0, ins = 0x7f92dc209858, upd = 0x7f92dc040a10, 
            entries = 91}, col_fix = {recno = 28400, bitf = 0x7f92dc209858 "(", entries = 3691252240}, col_var = {recno = 28400, d = 0x7f92dc209858, 
            repeats = 0x7f92dc040a10, nrepeats = 91, entries = 28231}}, dsk = 0x7f92dc252bb0, modify = 0x0, read_gen = 6334137, memory_footprint = 1180930, 
        type = 4 '\004', flags_atomic = 2 '\002'}
      

      Here are the stacks of all interesting threads (4 of 8). Thread 1 is the one that took the assertion. Thread 5 shows the file/line in the hazard pointer.

      (gdb) thread apply all bt
      Thread 5 (Thread 0x7f92ff7a7700 (LWP 43594)):
      #0  0x000000339e0daf87 in sched_yield () at ../sysdeps/unix/syscall-template.S:81
      WT-1  0x000000000042e67d in __wt_yield () at ../src/os_posix/os_yield.c:17
      WT-2  0x00000000004c0f0e in __wt_page_in_func (session=0x2792e10, ref=0x7f92bc1d2770, flags=0, file=0x67364d "../src/btree/col_srch.c", line=95)
          at ../src/btree/bt_page.c:152
      WT-3  0x00000000004c5eb0 in __wt_page_swap_func (session=0x2792e10, held=0x27cc4e8, want=0x7f92bc1d2770, flags=0, file=0x67364d "../src/btree/col_srch.c", line=95)
          at ../src/include/btree.i:1008
      WT-4  0x00000000004c66c0 in __wt_col_search (session=0x2792e10, recno=56723, leaf=0x0, cbt=0x7f92dc041780) at ../src/btree/col_srch.c:95
      WT-5  0x00000000004b2f67 in __cursor_col_search (session=0x2792e10, cbt=0x7f92dc041780) at ../src/btree/bt_cursor.c:223
      WT-6  0x00000000004b3292 in __wt_btcur_search (cbt=0x7f92dc041780) at ../src/btree/bt_cursor.c:308
      WT-7  0x000000000048927f in __curfile_search (cursor=0x7f92dc041780) at ../src/cursor/cur_file.c:165
      WT-8  0x00000000004088d9 in col_remove (cursor=0x7f92dc041780, key=0x7f92ff7a6e60, keyno=56723, notfoundp=0x7f92ff7a6e14) at ../../../test/format/ops.c:1031
      WT-9  0x0000000000406edd in ops (arg=0x27cce80) at ../../../test/format/ops.c:345
      WT-10 0x000000339e807c53 in start_thread (arg=0x7f92ff7a7700) at pthread_create.c:308
      WT-11 0x000000339e0f5dbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
      
      Thread 4 (Thread 0x7f92fd1a5700 (LWP 43593)):
      #0  0x0000000000473bb0 in __rec_col_var (session=0x2793090, r=0x7f92b421f660, page=0x7f92b427fa00, salvage=0x0) at ../src/btree/rec_write.c:3581
      WT-1  0x000000000046d6e9 in __wt_rec_write (session=0x2793090, ref=0x27cc720, salvage=0x0, flags=5) at ../src/btree/rec_write.c:405
      WT-2  0x0000000000467232 in __rec_review (session=0x2793090, ref=0x27cc720, exclusive=0, top=1, istree=0x7f92fd1a4a34) at ../src/btree/rec_evict.c:390
      WT-3  0x0000000000466737 in __wt_rec_evict (session=0x2793090, ref=0x27cc720, exclusive=0) at ../src/btree/rec_evict.c:54
      WT-4  0x0000000000457f6d in __wt_evict_page (session=0x2793090, ref=0x27cc720) at ../src/btree/bt_evict.c:618
      WT-5  0x0000000000459096 in __wt_evict_lru_page (session=0x2793090, is_app=1) at ../src/btree/bt_evict.c:1220
      WT-6  0x00000000004b271c in __wt_cache_full_check (session=0x2793090) at ../src/include/cache.i:133
      WT-7  0x00000000004b28fb in __cursor_enter (session=0x2793090) at ../src/include/cursor.i:62
      WT-8  0x00000000004b29d4 in __curfile_enter (cbt=0x7f92b44a5c10) at ../src/include/cursor.i:99
      WT-9  0x00000000004b2b01 in __cursor_func_init (cbt=0x7f92b44a5c10, reenter=1) at ../src/include/cursor.i:145
      WT-10 0x00000000004b3248 in __wt_btcur_search (cbt=0x7f92b44a5c10) at ../src/btree/bt_cursor.c:306
      WT-11 0x000000000048927f in __curfile_search (cursor=0x7f92b44a5c10) at ../src/cursor/cur_file.c:165
      WT-12 0x00000000004075de in read_row (cursor=0x7f92b44a5c10, key=0x7f92fd1a4e60, keyno=70321) at ../../../test/format/ops.c:542
      WT-13 0x000000000040714e in ops (arg=0x27cce30) at ../../../test/format/ops.c:416
      WT-14 0x000000339e807c53 in start_thread (arg=0x7f92fd1a5700) at pthread_create.c:308
      WT-15 0x000000339e0f5dbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
      
      Thread 2 (Thread 0x7f92fdfa4700 (LWP 52680)):
      #0  0x000000000046ba4a in __wt_cell_unpack_safe (cell=0x7f92a416685c, unpack=0x7f92fdfa3a00, end=0x0) at ../src/include/cell.i:599
      WT-1  0x000000000046bd87 in __wt_cell_unpack (cell=0x7f92a416685c, unpack=0x7f92fdfa3a00) at ../src/include/cell.i:709
      WT-2  0x0000000000470343 in __rec_split_raw_worker (session=0x2792410, r=0x7f92a412ad40, no_more_rows=1) at ../src/btree/rec_write.c:2118
      WT-3  0x000000000047138c in __rec_split_finish_raw (session=0x2792410, r=0x7f92a412ad40) at ../src/btree/rec_write.c:2548
      WT-4  0x00000000004713dd in __rec_split_finish (session=0x2792410, r=0x7f92a412ad40) at ../src/btree/rec_write.c:2563
      WT-5  0x0000000000474345 in __rec_col_var (session=0x2792410, r=0x7f92a412ad40, page=0x7f92b41e0000, salvage=0x0) at ../src/btree/rec_write.c:3878
      WT-6  0x000000000046d6e9 in __wt_rec_write (session=0x2792410, ref=0x7f92bc1d0830, salvage=0x0, flags=5) at ../src/btree/rec_write.c:405
      WT-7  0x0000000000467232 in __rec_review (session=0x2792410, ref=0x7f92bc1d0830, exclusive=0, top=1, istree=0x7f92fdfa3dc4) at ../src/btree/rec_evict.c:390
      WT-8  0x0000000000466737 in __wt_rec_evict (session=0x2792410, ref=0x7f92bc1d0830, exclusive=0) at ../src/btree/rec_evict.c:54
      WT-9  0x0000000000457f6d in __wt_evict_page (session=0x2792410, ref=0x7f92bc1d0830) at ../src/btree/bt_evict.c:618
      WT-10 0x0000000000459096 in __wt_evict_lru_page (session=0x2792410, is_app=1) at ../src/btree/bt_evict.c:1220
      WT-11 0x0000000000458157 in __evict_lru_pages (session=0x2792410, is_app=1) at ../src/btree/bt_evict.c:703
      WT-12 0x000000000045779f in __evict_worker (arg=0x27b8a50) at ../src/btree/bt_evict.c:365
      WT-13 0x000000339e807c53 in start_thread (arg=0x7f92fdfa4700) at pthread_create.c:308
      WT-14 0x000000339e0f5dbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
      
      Thread 1 (Thread 0x7f92fe7a5700 (LWP 52679)):
      #0  0x000000339e0359e9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
      WT-1  0x000000339e0370f8 in __GI_abort () at abort.c:90
      WT-2  0x00000000004a49bb in __wt_abort (session=0x2792190) at ../src/os_posix/os_abort.c:24
      WT-3  0x0000000000443dff in __wt_assert (session=0x2792190, error=0, file_name=0x6724f8 "../src/btree/bt_discard.c", line_number=67, fmt=0x6724b7 "%s")
          at ../src/support/err.c:452
      WT-4  0x00000000004b8e0e in __wt_page_out (session=0x2792190, pagep=0x7f92bc1d2770) at ../src/btree/bt_discard.c:67
      WT-5  0x00000000004b8cae in __wt_ref_out (session=0x2792190, ref=0x7f92bc1d2770) at ../src/btree/bt_discard.c:32
      WT-6  0x0000000000466ae9 in __rec_page_clean_update (session=0x2792190, ref=0x7f92bc1d2770) at ../src/btree/rec_evict.c:126
      WT-7  0x00000000004668a9 in __wt_rec_evict (session=0x2792190, ref=0x7f92bc1d2770, exclusive=0) at ../src/btree/rec_evict.c:80
      WT-8  0x0000000000457f6d in __wt_evict_page (session=0x2792190, ref=0x7f92bc1d2770) at ../src/btree/bt_evict.c:618
      WT-9  0x0000000000459096 in __wt_evict_lru_page (session=0x2792190, is_app=1) at ../src/btree/bt_evict.c:1220
      WT-10 0x0000000000458157 in __evict_lru_pages (session=0x2792190, is_app=1) at ../src/btree/bt_evict.c:703
      WT-11 0x000000000045779f in __evict_worker (arg=0x27b8a30) at ../src/btree/bt_evict.c:365
      WT-12 0x000000339e807c53 in start_thread (arg=0x7f92fe7a5700) at pthread_create.c:308
      WT-13 0x000000339e0f5dbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
      

      So we have a race with page in/out and hazard pointers. Thread 5 is trying to set the hazard pointer in *wt_hazard_set. After setting the page, file and line, we do a WT_FULL_BARRIER. Then it decides that the page is not available, setting hp->page = NULL but without publishing that. At the same time, thread 1, an eviction thread, is doing its last-minute checks on page out. It is in *wt_page_hazard_check and happens to be walking that session's hazard pointers and finds the same hp->page.

      In __wt_hazard_set we say:

                       * We don't bother publishing this update: the worst case is we
                       * prevent some random page from being evicted.
      

      Perhaps we do need to publish this update. Although it seems like that just narrows a window anyway. We set the page and then could be descheduled before deciding it's busy and clearing the page.

            Assignee:
            Unassigned Unassigned
            Reporter:
            sue.loverso@mongodb.com Susan LoVerso
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: