Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-760

Hang in log consolidation code

    XMLWordPrintable

    Details

    • Type: Task
    • Status: Closed
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      When I run against LinkBench, I occasionally see a hang:

           34 pthread_cond_wait@@GLIBC_2.3.2,__wt_cond_wait,__wt_log_slot_wait,__wt_log_write,__wt_txn_log_commit,__wt_txn_commit,__session_commit_transaction
           15 pthread_cond_wait@@GLIBC_2.3.2,__wt_cond_wait,__log_release,__wt_log_write,__wt_txn_log_commit,__wt_txn_commit,__session_commit_transaction
            1 slot=0x27dd890),__wt_log_write,__wt_txn_log_commit,__wt_txn_commit,__session_commit_transaction
      

      There are 50 application threads. 34 of those threads are waiting for their slots to be finalized. 15 of those threads are waiting for their chance to release their slot. 1 of those threads is attempting to find a new slot in the slot pool.

      If I capture the hang in GDB I can see:

      Thread 67 (Thread 0x7ff98827d700 (LWP 12971)):
      #0  0x00007ff9aa10c7cc in __wt_log_slot_close (
          session=session@entry=0x1bf4118, slot=0x1c06670)
          at ../src/log/log_slot.c:197
      WT-1  0x00007ff9aa10b440 in __wt_log_write (session=session@entry=0x1bf4118, 
          record=0x7ff57c011940, lsnp=lsnp@entry=0x0, flags=2)
          at ../src/log/log.c:1018
      WT-2  0x00007ff9aa13ad0d in __wt_txn_log_commit (
          session=session@entry=0x1bf4118, cfg=cfg@entry=0x7ff98827abe0)
          at ../src/txn/txn_log.c:129
      WT-3  0x00007ff9aa1391a6 in __wt_txn_commit (session=session@entry=0x1bf4118, 
          cfg=cfg@entry=0x7ff98827abe0) at ../src/txn/txn.c:352
      WT-4  0x00007ff9aa1317d5 in __session_commit_transaction (wt_session=0x1bf4118, 
          config=<optimised out>) at ../src/session/session_api.c:649
      (More stack frames follow...)
       
      (gdb) thread 67
      [Switching to thread 67 (Thread 0x7ff98827d700 (LWP 12971))]
      #0  0x00007ff9aa10c7cc in __wt_log_slot_close (
          session=session@entry=0x1bf4118, slot=0x1c06670)
          at ../src/log/log_slot.c:197
      197		if (++log->pool_index >= SLOT_POOL)
      (gdb) p slot->slot_release_lsn 
      $1 = {file = 155, offset = 42647808}
      (gdb) p log->write_lsn
      $2 = {file = 155, offset = 42648192}
      (gdb) p *slot
      $3 = {slot_state = 25347, slot_group_size = 384, slot_error = 0, 
        slot_index = 0, slot_start_offset = 42647808, slot_release_lsn = {
          file = 155, offset = 42647808}, slot_start_lsn = {file = 155, 
          offset = 42647808}, slot_end_lsn = {file = 155, offset = 42648192}, 
        slot_fh = 0x7ff4d1e3b680, slot_buf = {data = 0x7ff89bddf010, size = 0, 
          flags = 0, mem = 0x7ff89bddf010, memsize = 40706048}, 
        slot_done_cond = 0x1c06d20, flags = 2}
      (gdb) p slot->slot_end_lsn 
      $4 = {file = 155, offset = 42648192}
      

      So.. The thread in

       has a 

      slot_release_lsn

       that is one behind the current 

      log->write_lsn

      . That should not happen.
       
      I think there is a real issue in this code, if there are two or more active slots, then a slot may own the oldest slot_release_lsn, and get stuck waiting for an available slot in 

      __wt_slot_close{{`}} which would lead to a hang.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                alexander.gorrod Alexander Gorrod
              • Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: