Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-7284

Checkpoint timeout on an eviction gen while one eviction is walking the history store tree

    XMLWordPrintable

    Details

    • Type: Build Failure
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Story Points:
      8
    • Sprint:
      Storage - Ra 2021-03-22, Storage - Ra 2021-04-05

      Description

      Here's the BFG link I did some digging:
      https://jira.mongodb.org/browse/BFG-778999

      The error shows checkpoint times out on an eviction gen:

      WiredTiger error
      error: 0
      message: [1615372118:172][24168:0x7f4b7933b700], file:index-35-8537597942191832685.wt, WT_SESSION.checkpoint: __wt_gen_drain, 169: minutes < 4
      

      By looking at the core dump, we are in the checkpoint stack and waiting on the following eviction to finish:

      #0  0x00007f4b9021bfb7 in raise () from /lib/x86_64-linux-gnu/libc.so.6
      #1  0x00007f4b9021d921 in abort () from /lib/x86_64-linux-gnu/libc.so.6
      #2  0x000055ae6cb37488 in __wt_abort (session=<optimized out>) at src/third_party/wiredtiger/src/os_common/os_abort.c:30
      #3  0x000055ae6cc53e3e in __wt_gen_drain (session=0x7f4b81325f20, which=2, generation=45) at src/third_party/wiredtiger/src/support/generation.c:169
      #4  0x000055ae6cc535c5 in __wt_gen_next_drain (session=0x7f4b81325f20, which=2) at src/third_party/wiredtiger/src/support/generation.c:97
      #5  0x000055ae6c888941 in __wt_sync_file (session=<optimized out>, syncop=<optimized out>) at src/third_party/wiredtiger/src/btree/bt_sync.c:539
      #6  0x000055ae6cd376df in __checkpoint_tree (session=0x7f4b81325f20, is_checkpoint=<optimized out>, cfg=<optimized out>) at src/third_party/wiredtiger/src/txn/txn_ckpt.c:1712
      #7  0x000055ae6cd303e1 in __checkpoint_tree_helper (session=<optimized out>, cfg=<optimized out>) at src/third_party/wiredtiger/src/txn/txn_ckpt.c:1820
      #8  __checkpoint_apply_to_dhandles (session=<optimized out>, cfg=<optimized out>, op=<optimized out>) at src/third_party/wiredtiger/src/txn/txn_ckpt.c:197
      #9  __txn_checkpoint (session=<optimized out>, cfg=<optimized out>) at src/third_party/wiredtiger/src/txn/txn_ckpt.c:880
      #10 __txn_checkpoint_wrapper (session=0x7f4b81325f20, cfg=0x7f4b79339300) at src/third_party/wiredtiger/src/txn/txn_ckpt.c:1093
      #11 0x000055ae6cd2ae06 in __wt_txn_checkpoint (session=0x7f4b81325f20, cfg=0x7f4b79339300, waiting=<optimized out>) at src/third_party/wiredtiger/src/txn/txn_ckpt.c:1147
      #12 0x000055ae6cc2eccb in __session_checkpoint (wt_session=0x7f4b81325f20, config=0x55ae74fe96e0 <str> "use_timestamp=true") at src/third_party/wiredtiger/src/session/session_api.c:1890
      (gdb) p s
      $3 = (WT_SESSION_IMPL *) 0x7f4b81324f60
      

      While the same session is doing a history store walk when trying to insert a record to the history store in eviction:

      #0  0x000055ae6c8cc5a8 in __wt_lex_compare_skip (user_item=<optimized out>, tree_item=<optimized out>, matchp=<optimized out>) at src/third_party/wiredtiger/src/include/btree_cmp_inline.h:126
      #1  __wt_compare_skip (session=<optimized out>, collator=0x0, user_item=<optimized out>, tree_item=<optimized out>, cmpp=<optimized out>, matchp=<optimized out>)
          at src/third_party/wiredtiger/src/include/btree_cmp_inline.h:201
      #2  __wt_search_insert (session=0x7f4b81324f60, cbt=0x6190001abd80, ins_head=0x60e000120d20, srch_key=0x7f4b5083e4a0) at src/third_party/wiredtiger/src/btree/row_srch.c:110
      #3  0x000055ae6c7d1db9 in __cursor_skip_prev (cbt=0x6190001abd80) at src/third_party/wiredtiger/src/btree/bt_curprev.c:54
      #4  0x000055ae6c7cb76b in __cursor_row_prev (cbt=<optimized out>, newpage=<optimized out>, restart=<optimized out>, skippedp=<optimized out>) at src/third_party/wiredtiger/src/btree/bt_curprev.c:509
      #5  __wt_btcur_prev (cbt=0x6190001abd80, truncating=<optimized out>) at src/third_party/wiredtiger/src/btree/bt_curprev.c:656
      #6  0x000055ae6c98e2c8 in __curfile_prev (cursor=<optimized out>) at src/third_party/wiredtiger/src/cursor/cur_file.c:146
      #7  0x000055ae6c9b98a6 in __curhs_file_cursor_prev (session=0x7f4b81324f60, cursor=0x6190001abd80) at src/third_party/wiredtiger/src/cursor/cur_hs.c:104
      #8  0x000055ae6c9ba529 in __curhs_prev_visible (session=<optimized out>, hs_cursor=<optimized out>) at src/third_party/wiredtiger/src/cursor/cur_hs.c:361
      #9  0x000055ae6c9ae645 in __curhs_search_near (cursor=0x614001f56840, exactp=0x7f4b5083f140) at src/third_party/wiredtiger/src/cursor/cur_hs.c:683
      #10 0x000055ae6c9a5d38 in __curhs_search_near_helper (session=<optimized out>, cursor=0x614001f56840, before=false) at src/third_party/wiredtiger/src/cursor/cur_hs.c:542
      #11 0x000055ae6ca9cfa1 in __hs_insert_record (session=0x7f4b81324f60, cursor=0x614001f56840, btree=<optimized out>, key=0x604000b21290, type=3 '\003', hs_value=0x604000b20b90, tw=<optimized out>)
          at src/third_party/wiredtiger/src/history/hs_rec.c:156
      #12 0x000055ae6ca980b2 in __wt_hs_insert_updates (session=0x7f4b81324f60, page=0x6070001a8700, multi=0xfe96a107ffc) at src/third_party/wiredtiger/src/history/hs_rec.c:604
      #13 0x000055ae6cbc4f99 in __rec_hs_wrapup (session=<optimized out>, r=<optimized out>) at src/third_party/wiredtiger/src/reconcile/rec_write.c:2294
      #14 __rec_write_wrapup (session=0x7f4b81324f60, r=0x61d0000b5480, page=0x6070001a8700) at src/third_party/wiredtiger/src/reconcile/rec_write.c:2120
      #15 0x000055ae6cbb4a77 in __reconcile (session=<optimized out>, ref=<optimized out>, salvage=<optimized out>, flags=<optimized out>, page_lockedp=<optimized out>)
          at src/third_party/wiredtiger/src/reconcile/rec_write.c:219
      #16 __wt_reconcile (session=0x7f4b81324f60, ref=0x60e000180e80, salvage=0x0, flags=179) at src/third_party/wiredtiger/src/reconcile/rec_write.c:99
      #17 0x000055ae6ca776e3 in __evict_review (session=<optimized out>, ref=<optimized out>, evict_flags=<optimized out>, inmem_splitp=<optimized out>) at src/third_party/wiredtiger/src/evict/evict_page.c:699
      #18 __wt_evict (session=0x7f4b81324f60, ref=0x60e000180e80, previous_state=3 '\003', flags=4) at src/third_party/wiredtiger/src/evict/evict_page.c:159
      #19 0x000055ae6ca749e1 in __wt_page_release_evict (session=0x7f4b81324f60, ref=<optimized out>, flags=0) at src/third_party/wiredtiger/src/evict/evict_page.c:81
      #20 0x000055ae6c7dc42e in __cursor_reset (cbt=<optimized out>) at src/third_party/wiredtiger/src/include/btree_inline.h:1561
      #21 0x000055ae6c7e22ef in __wt_btcur_insert (cbt=0x61900022a180) at src/third_party/wiredtiger/src/btree/bt_cursor.c:879
      #22 0x000055ae6c9965ce in __curfile_insert (cursor=<optimized out>) at src/third_party/wiredtiger/src/cursor/cur_file.c:266
      

      I suspect it is stuck in the following while loop:

      restart:
          /*
           * If the search stack does not point at the current item, fill it in with a search.
           */
          recno = WT_INSERT_RECNO(cbt->ins);
          while ((current = cbt->ins) != PREV_INS(cbt, 0)) {
              if (CUR2BT(cbt)->type == BTREE_ROW) {
                  key.data = WT_INSERT_KEY(current);
                  key.size = WT_INSERT_KEY_SIZE(current);
                  WT_RET(__wt_search_insert(session, cbt, cbt->ins_head, &key));
              } else
                  cbt->ins = __col_insert_search(cbt->ins_head, cbt->ins_stack, cbt->next_stack, recno);
          }
      

      This starts to happen after we merge the history store restructure project (PM-1981). It is likely a fallout from it.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              chenhao.qu Chenhao Qu
              Reporter:
              chenhao.qu Chenhao Qu
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: