Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-313

Race when checkpointing and using bulk cursors

    • Type: Icon: Task Task
    • Resolution: Done
    • WT1.3
    • Affects Version/s: None
    • Component/s: None
    • None

      Michael and I ran into an issue today when running test/format on the LSM code.

      It turns out that there is an issue when doing a checkpoint while closing a bulk cursor. The issue isn't related to LSM.

      I've made some changes to the fop test application that demonstrate the problem. I pushed the changes to a new branch fops-bulk (https://github.com/wiredtiger/wiredtiger/tree/fops-bulk).

      If I run fops with:
      ./t -n 1000 -r 1 -t 2

      It regularly hangs. When I capture the state in a debugger, I can see:

      Thread 4 (process 6614):
      #0  0x00007fff8df83122 in __psynch_mutexwait ()
      WT-1  0x00007fff8f23cddd in pthread_mutex_lock ()
      WT-2  0x000000010005595c in __wt_spin_lock (session=0x100804c30, t=0x1008044f0) at mutex.i:81
      WT-3  0x0000000100055852 in __curbulk_close (cursor=0x101800500) at cur_bulk.c:53
      WT-4  0x00000001000013ac in obj_bulk () at file.c:31
      WT-5  0x0000000100001ca3 in fop (arg=0x1) at fops.c:134
      WT-6  0x00007fff8f237782 in _pthread_start ()
      WT-7  0x00007fff8f2241c1 in thread_start ()
      
      Thread 3 (process 6614):
      #0  0x00007fff8df8315e in __psynch_rw_rdlock ()
      WT-1  0x00007fff8f23d915 in pthread_rwlock_rdlock ()
      WT-2  0x0000000100067a43 in __wt_readlock (session=0x100804e48, rwlock=0x100600500) at os_mtx.c:176
      WT-3  0x0000000100051a41 in __conn_btree_open_lock (session=0x100804e48, flags=0) at conn_btree.c:36
      WT-4  0x0000000100051c8d in __conn_btree_get (session=0x100804e48, name=0x1018002f0 "file:__wt", ckpt=0x0, flags=0) at conn_btree.c:106
      WT-5  0x000000010005249d in __wt_conn_btree_get (session=0x100804e48, name=0x1018002f0 "file:__wt", ckpt=0x0, cfg=0x0, flags=0) at conn_btree.c:254
      WT-6  0x000000010007e507 in __wt_session_get_btree (session=0x100804e48, uri=0x1018002f0 "file:__wt", checkpoint=0x0, cfg=0x0, flags=0) at session_btree.c:244
      WT-7  0x00000001000624c6 in __wt_meta_btree_apply (session=0x100804e48, func=0x100086c30 <__wt_checkpoint>, cfg=0x100480e48, flags=0) at meta_apply.c:37
      WT-8  0x0000000100086673 in __wt_txn_checkpoint (session=0x100804e48, cfg=0x100480e48) at txn_ckpt.c:100
      WT-9  0x000000010007d76b in __session_checkpoint (wt_session=0x100804e48, config=0x100087c12 "name=fops") at session_api.c:509
      WT-10 0x000000010000169e in obj_checkpoint () at file.c:84
      WT-11 0x0000000100001c5d in fop (arg=0x0) at fops.c:122
      WT-12 0x00007fff8f237782 in _pthread_start ()
      WT-13 0x00007fff8f2241c1 in thread_start ()
      

      The bulk close is attempting to get the schema lock while holding the handle lock. The checkpoint is attempting to get the handle lock while holding the schema lock.

      I'm wondering if checkpoint should skip files that are being used for bulk load. Do you think that is a reasonable approach? I guess it would skip creating an empty file in a checkpoint if the open happened before a checkpoint started and the bulk cursor was opened after.

            Assignee:
            keith.bostic@mongodb.com Keith Bostic (Inactive)
            Reporter:
            alexander.gorrod@mongodb.com Alexander Gorrod
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: