Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-3186

Fix error path and panic detection in logging loops

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • WT2.9.2, 3.2.13, 3.4.3, 3.5.4
    • Affects Version/s: None
    • Component/s: None
    • Labels:
      None
    • Storage 2017-03-06

      The new checkpoint test uncovered a hang that can occur in the logging code. The job is on tinderbox. The only interesting stack is thread 1:

      Thread 1 (Thread 0x7f57728fa700 (LWP 11342)):
      #0  0x0000003e6cae6e07 in sched_yield () from /lib64/libc.so.6
      #1  0x00000000004359e8 in __wt_yield () at ../src/os_posix/os_yield.c:27
      #2  0x000000000042771a in __wt_log_slot_join (session=session@entry=0x224ebb8, 
          mysize=mysize@entry=4864, flags=flags@entry=20, 
          myslot=myslot@entry=0x7ffc03db0c20) at ../src/log/log_slot.c:537
      #3  0x00000000004246ad in __log_write_internal (flags=20, lsnp=0x0, 
          record=0x2248060, session=0x224ebb8) at ../src/log/log.c:2135
      #4  __wt_log_write (session=session@entry=0x224ebb8, record=<optimized out>, 
          lsnp=lsnp@entry=0x0, flags=20) at ../src/log/log.c:2062
      #5  0x000000000047b8c9 in __wt_txn_log_commit (
          session=session@entry=0x224ebb8, cfg=cfg@entry=0x7ffc03db0cf0)
          at ../src/txn/txn_log.c:222
      #6  0x00000000004767f3 in __wt_txn_commit (session=session@entry=0x224ebb8, 
          cfg=cfg@entry=0x7ffc03db0cf0) at ../src/txn/txn.c:583
      #7  0x000000000045f901 in __session_commit_transaction (wt_session=0x224ebb8, 
          config=<optimized out>) at ../src/session/session_api.c:1427
      #8  0x0000000000405260 in subtest_populate (close_test=true, 
          opts=0x7ffc03db0e00)
          at ../../../test/csuite/wt2909_checkpoint_integrity/main.c:567
      #9  subtest_main (argc=argc@entry=9, argv=argv@entry=0x7ffc03db1850, 
          close_test=close_test@entry=true)
          at ../../../test/csuite/wt2909_checkpoint_integrity/main.c:499
      #10 0x0000000000404acf in subtest_main (close_test=true, argv=0x7ffc03db1850, 
          argc=9) at ../../../test/csuite/wt2909_checkpoint_integrity/main.c:457
      #11 main (argc=9, argv=0x7ffc03db1850)
          at ../../../test/csuite/wt2909_checkpoint_integrity/main.c:641
      

      Although the local slot variable had been optimized out, I managed to figure out which slot. Here is the slot's contents. Note that it is in a closed state and that the slot has an error set on it.

      (gdb) p/x log->slot_pool[12]
      $8 = {{{slot_state = 0x4000008000000080, slot_unbuffered = 0x0, 
            slot_error = 0x5, slot_start_offset = 0x28bd00, 
            slot_last_offset = 0x28bd00, slot_release_lsn = {l = {offset = 0x28bd00, 
                file = 0x1}, file_offset = 0x10028bd00}, slot_start_lsn = {l = {
                offset = 0x28bd00, file = 0x1}, file_offset = 0x10028bd00}, 
            slot_end_lsn = {l = {offset = 0x28bd80, file = 0x1}, 
              file_offset = 0x10028bd80}, slot_fh = 0x2273e30, slot_buf = {
              data = 0x7f57723a8010, size = 0x0, mem = 0x7f57723a8010, 
              memsize = 0x40000, flags = 0x0}, flags = 0x0}, __padding = {0x80, 0x0, 
            0x0, 0x0, 0x80, 0x0, 0x0, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 
            0x5, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xbd, 0x28, 0x0, 0x0, 0x0, 
            0x0, 0x0, 0x0, 0xbd, 0x28, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xbd, 0x28, 
            0x0, 0x1, 0x0, 0x0, 0x0, 0x0, 0xbd, 0x28, 0x0, 0x1, 0x0, 0x0, 0x0, 0x80, 
            0xbd, 0x28, 0x0, 0x1, 0x0, 0x0, 0x0}}}
      

      I am also not sure how/why this closed slot is still the active slot because there are many free slots available.

            Assignee:
            sue.loverso@mongodb.com Susan LoVerso
            Reporter:
            sue.loverso@mongodb.com Susan LoVerso
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: