Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-4366

Fix how test/format handles prepare conflict errors

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 4.1.7, WT3.2.0
    • Affects Version/s: None
    • Component/s: None
    • Labels:
    • 13
    • Storage Engines 2018-10-22, Storage Engines 2018-11-05, Storage Engines 2018-11-19, Storage Engines 2018-12-03, Storage Engines 2018-12-17, Storage Engines 2018-12-31

      Edited Slack conversation:

      keith.bostic [5:25 PM]
      So, this is the CONFIG from WT-4362.

      The primary failure is fixed by your WT-3898 branch (nice fix, btw!).

      michael.cahill [5:26 PM]
      But now it's getting stuck instead?

      keith.bostic [5:26 PM]
      There's a secondary failure, though. A snapshot-isolation transaction is being repeated (to confirm the values being read are the same), and one of the reads is returning WT_PREPARE_CONFLICT. I don't think that should happen?

      michael.cahill [5:26 PM]
      Hmm, let me check that. Does the read have a timestamp?

      keith.bostic [5:28 PM]
      We've done a search, we're on an insert, and we're finding a single WT_UPDATE:

      (gdb) where
      #0  0x00007f5b44df2c73 in select () from /lib64/libc.so.6
      #1  0x000000000045c91e in __wt_sleep (seconds=100, micro_seconds=0)
          at src/os_posix/os_sleep.c:32
      #2  0x0000000000454c0d in __wt_abort (session=0x7f5b462e0f00)
          at src/os_common/os_abort.c:28
      #3  0x00000000004d3bfd in __wt_txn_read (session=0x7f5b462e0f00, 
          upd=0x7f5b101e2060, updp=0x7f5b3a7f9890) at ./src/include/txn.i:1008
      #4  0x00000000004d5a6e in __wt_cursor_valid (cbt=0x7f5b20051880, 
          updp=0x7f5b3a7f98f8, valid=0x7f5b3a7f98f7) at src/btree/bt_cursor.c:270
      #5  0x00000000004d66aa in __wt_btcur_search (cbt=0x7f5b20051880)
          at src/btree/bt_cursor.c:587
      #6  0x00000000005401e1 in __curfile_search (cursor=0x7f5b20051880)
          at src/cursor/cur_file.c:200
      #7  0x0000000000407d12 in snap_check (cursor=0x7f5b20051880, 
          start=0x7f5b3a7f9aa0, stop=0x7f5b3a7f9ae8, key=0x20e91e0, value=0x20e9210)
          at ops.c:405
      #8  0x0000000000409a1c in ops (arg=0x20e9160) at ops.c:1060
      (gdb) frame 4
      #4  0x00000000004d5a6e in __wt_cursor_valid (cbt=0x7f5b20051880, 
          updp=0x7f5b3a7f98f8, valid=0x7f5b3a7f98f7) at src/btree/bt_cursor.c:270
      270            WT_RET(__wt_txn_read(session, cbt->ins->upd, &upd));
      (gdb) p cbt.ins.upd
      $8 = (WT_UPDATE *) 0x7f5b101e2060
      (gdb) p $8.next
      $9 = (WT_UPDATE *) 0x0
      (gdb) down
      #3  0x00000000004d3bfd in __wt_txn_read (session=0x7f5b462e0f00, 
          upd=0x7f5b101e2060, updp=0x7f5b3a7f9890) at ./src/include/txn.i:1008
      1008                    WT_ASSERT(session, session->xxx == false);
      (gdb) p *upd
      $10 = {
        txnid = 17167, 
        timestamp = {
          val = 31998
        }, 
        next = 0x0, 
        size = 0, 
        type = 5 '\005', 
        prepare_state = 3 '\003', 
        data = 0x7f5b101e207e ""
      }
      (gdb) p upd_visible
      $11 = WT_VISIBLE_PREPARE
      (gdb) p session.txn
      $12 = {
        id = 0, 
        isolation = WT_ISO_SNAPSHOT, 
        forced_iso = 0, 
        snap_min = 19318, 
        snap_max = 19334, 
        snapshot = 0x7f5b200048e0, 
        snapshot_count = 2, 
        txn_logsync = 0, 
        commit_timestamp = {
          val = 35215
        }, 
        first_commit_timestamp = {
          val = 35215
        }, 
        prepare_timestamp = {
          val = 35064
        }, 
        read_timestamp = {
          val = 35216
        }, 
        commit_timestampq = {
          tqe_next = 0x7f5b462e1e90, 
          tqe_prev = 0x1efaea0
        }, 
        read_timestampq = {
          tqe_next = 0x7f5b462e03f0, 
          tqe_prev = 0x1efaed8
        }, 
        clear_commit_q = false, 
        clear_read_q = false, 
        mod = 0x7f5b2019a510, 
        mod_alloc = 2080, 
        mod_count = 0, 
        multi_update_count = 0, 
        logrec = 0x0, 
        notify = 0x0, 
        ckpt_lsn = {
          l = {
            offset = 0, 
            file = 0
          }, 
          file_offset = 0
        }, 
        ckpt_nsnapshot = 0, 
        ckpt_snapshot = 0x0, 
        full_ckpt = false, 
        rollback_reason = 0x0, 
        flags = 5160
      }
      

      michael.cahill [5:29 PM]
      Hmm, that transaction does have a read timestamp.
      So I think this behavior is expected.
      What do you think should be happening?

      keith.bostic [5:29 PM]
      I have no idea.
      If this is OK, I'm happy to push that fix.

      michael.cahill [5:29 PM]
      The update is prepared (i.e., we don't know what its commit timestamp will be).
      So we can't tell whether it will be visible to that reader or not.
      That's what it means when we return WT_PREPARE_CONFLICT.

      keith.bostic [5:30 PM]
      We're in snap-check — in other words, we're repeating a read that worked for us before.

      michael.cahill [5:30 PM]
      Sure, this return means "keep trying until the prepared transaction is resolved".
      Was this value read before, or is there a scan?
      i.e., this has been inserted into the middle of a range that was previously scanned?

      keith.bostic [5:31 PM]
      Read before — we don't repeat scans (which we should, but that's another question)

      michael.cahill [5:31 PM]
      OK, that seems a little strange then. Let me stare more at the timestamps...

      keith.bostic [5:31 PM]
      So, we did a READ, decided to resolve the txn, repeated the read, and it failed.
      I've got it in a debugger...
      This one doesn't trivially reproduce, I've seen it go hours w/o a hit (and then 5 hits in 5 minutes, so...)

      michael.cahill [5:34 PM]
      So prepare_state = 3 means the prepare has been resolved — I would expect it to be visible. Checking the code...
      From looking at __wt_txn_upd_visible_type, I think the read must have raced with the transaction committing.
      But then I don't understand how test/format could have read the value previously.

      keith.bostic [5:37 PM]
      The tracked operation was either a search or search-near, but that's all I know about it.

      michael.cahill [5:37 PM]
      I'm looking at snap_check in format's ops.c — it looks at all keys in a range, right?

      keith.bostic [5:37 PM]
      We're in a transaction.
      No...
      It's either line 831 or 913
      Which tracked the operation.
      op == READ
      keyno == 773914
      Matches the insert item:

      (gdb) p *cbt.ins
      $15 = {
        upd = 0x7f5b101e2060, 
        u = {
          recno = 773914, 
          key = {
            offset = 773914, 
            size = 0
          }
        }, 
        next = 0x7f5b101e2250
      }
      

      michael.cahill [5:39 PM]
      Got it, start / stop are pointers into an array, not keys.

      keith.bostic [5:39 PM]
      Yes.

      (gdb) p *start
      $13 = {
        op = READ, 
        keyno = 773914, 
        last = 0, 
        kdata = 0x0, 
        ksize = 0, 
        kmemsize = 0, 
        vdata = 0x7f5b20129a70, 
        vsize = 22, 
        vmemsize = 369
      }
      

      The WT_UPDATE is a tombstone, is that OK?
      I can't get my hands on the page itself w/o a ton of work — easier to recompile & attach at next failure.

      michael.cahill [5:43 PM]
      So it looks to me as if we're being over-eager with this error: we should be able to figure out that this read is in the past of the prepare timestamp and skip over the tombstone immediately.
      That said, it's an optimization: the expectation is that the application will keep retrying when it gets WT_PREPARE_CONFLICT until the prepared transaction commits or aborts.

      keith.bostic [5:44 PM]
      I can loop the error for now?

      michael.cahill [5:45 PM]
      Not just for now: format can't treat it as an error because it means "no answer yet, try again".

      keith.bostic [5:45 PM]
      Should we be doing that in general? All returns of WT_PREPARE_CONFLICT should loop?

      michael.cahill [5:45 PM]
      Yes.

      keith.bostic [5:46 PM]
      (And, of course, is it reasonable to do it in the library a la WT_ROLLBACK?)

      michael.cahill [5:46 PM]
      No.

      keith.bostic [5:46 PM]
      Thought you'd say that.

      michael.cahill [5:46 PM]
      Actually, we could think about doing something, but MongoDB doesn't want threads stuck inside WT.

      keith.bostic [5:46 PM]
      Format currently treats it as WT_ROLLBACK (which works except for here)

      michael.cahill [5:47 PM]
      For update operations, rolling back is usually correct — the update will only succeed if the prepared transaction rolls back.
      But for reads, it means "no answer yet".

      keith.bostic [5:48 PM]
      But reads should loop until they get a different result?

      michael.cahill [5:48 PM]
      Yes.

      keith.bostic [5:48 PM]
      OK. I'll open a new ticket and do that work in format.
      Thanks! God knows when I'd have figured that out...

      michael.cahill [5:48 PM]
      Thanks.
      I think it would also be worth reaching out to Alex and Vamsi: in this case, I think WT should be able to skip that update immediately.

      keith.bostic [5:49 PM]
      OK — I'll do the write up & ping them.
      Alex can schedule the work if & when.
      To be clear, the correct information is going to be on the page, right?

      michael.cahill [5:50 PM]
      Thanks: unless I'm misunderstanding, we probably want a separate ticket for that (it will just make this case less likely for format, but still need a retry).

      keith.bostic [5:50 PM]
      OK, 2 tickets.

      michael.cahill [5:50 PM]
      Yes, in this case we need to wait to figure out that the tombstone is in the future of the read, so the on-page value is correct.

      keith.bostic [5:50 PM]
      All good, I'll write that up.

      michael.cahill [5:50 PM]
      Thanks!

      michael.cahill [5:58 PM]
      Hmm, looking some more at __wt_txn_upd_visible_type, it's doing what I expect. And the update's timestamp is definitely in the past.

      Given that, I think the read must have raced with the commit of the transaction that prepared this update.
      But then retrying it would return WT_NOTFOUND, which is not what test/format is expecting.
      So I still think format needs to retry, but maybe Alex / Vamsi need to double check the logic here to figure out how format could have read the value successfully.

        1. CONFIG_4366
          2 kB
        2. ops_4366.c
          55 kB

            Assignee:
            vamsi.krishna@mongodb.com Vamsi Boyapati
            Reporter:
            keith.bostic@mongodb.com Keith Bostic (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: