Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-6676

Fast truncate timestamp get lost after restart

      History store verification is still failing in the rts-test-format branch. The symptom is that after restarting from a backup, we find a history store record with a non-globally visible tombstone but we can't find the corresponding key in the data store.

      With the help of wt dump, I found the missing key from data store is actually present in the backup file after restart. However, all the keys on the page that has the missing key have a globally visible tombstone (txnid 0, timestamp 0) appended by __wt_delete_page_instantiate when reading a fast truncated page back into memory.

      0x1655160: row-store leaf
              0x1655160, memory, leaf, newest durable: (0, 187400)/(0, 0) oldest start: (0, 0)/0 newest stop (4294967295, 4294967295)/18446744073709551605, [102566400-102566912, 512, 1114934664]
              disk 0x1413990, dsk_mem_size 584, entries 9, clean, disk-alloc, page-state=0, memory-size 1664
              K {0000323521.00/opqrstuvwxyzabcdefg}
              value/short: len 28
              V {0000323521/LMNOPQRSTUVWXYZAB}
              value {tombstone}
              txn id 0, start_ts (0, 0)
              K {0000323522.00/opqrstuvwxyzabcd}
              value/short: len 26
              V {0000323522/LMNOPQRSTUVWXYZ}
              value {tombstone}
              txn id 0, start_ts (0, 0)
              K {0000323523.00/opqrstuvwxyzabcdefghijklmnopqrstuvwxyzab}
              value/short: len 9
              V {000032352}
              value {tombstone}
              txn id 0, start_ts (0, 0)
              K {0000323524.00/opqrstuvwxyzabcdef}
              value/short: len 18
              V {0000323524/LMNOPQR}
              value {tombstone}
              txn id 0, start_ts (0, 0)
              K {0000323525.00/opqrstuvwxyzabcdefghijklmnopqrstuvw}
              value/short: len 17
              V {0000323525/LMNOPQ}
              value {tombstone}
              txn id 0, start_ts (0, 0)
      

      The key then is removed after the page is evicted again because of the globally visible tombstone so the history store verification cannot find the key in the data store anymore.

      /*
       * __rec_child_deleted --
       *     Handle pages with leaf pages in the WT_REF_DELETED state.
       */
      static int
      __rec_child_deleted(WT_SESSION_IMPL *session, WT_RECONCILE *r, WT_REF *ref, WT_CHILD_STATE *statep)
      {
          WT_PAGE_DELETED *page_del;
      
          page_del = ref->page_del;
      
          /*
           * Internal pages with child leaf pages in the WT_REF_DELETED state are a special case during
           * reconciliation. First, if the deletion was a result of a session truncate call, the deletion
           * may not be visible to us. In that case, we proceed as with any change not visible during
           * reconciliation by ignoring the change for the purposes of writing the internal page.
           *
           * In this case, there must be an associated page-deleted structure, and it holds the
           * transaction ID we care about.
           *
           * In some cases, there had better not be any updates we can't see.
           *
           * A visible update to be in READY state (i.e. not in LOCKED or PREPARED state), for truly
           * visible to others.
           */
          if (F_ISSET(r, WT_REC_CLEAN_AFTER_REC | WT_REC_VISIBILITY_ERR) && page_del != NULL &&
            __wt_page_del_active(session, ref, false)) {
              if (F_ISSET(r, WT_REC_VISIBILITY_ERR))
                  WT_RET_PANIC(session, EINVAL, "reconciliation illegally skipped an update");
              return (__wt_set_return(session, EBUSY));
          }
      
          /*
           * Deal with any underlying disk blocks.
           *
           * First, check to see if there is an address associated with this leaf: if there isn't, we're
           * done, the underlying page is already gone. If the page still exists, check for any
           * transactions in the system that might want to see the page's state before it's deleted.
           *
           * If any such transactions exist, we cannot discard the underlying leaf page to the block
           * manager because the transaction may eventually read it. However, this write might be part of
           * a checkpoint, and should we recover to that checkpoint, we'll need to delete the leaf page,
           * else we'd leak it. The solution is to write a proxy cell on the internal page ensuring the
           * leaf page is eventually discarded.
           *
           * If no such transactions exist, we can discard the leaf page to the block manager and no cell
           * needs to be written at all. We do this outside of the underlying tracking routines because
           * this action is permanent and irrevocable. (Clearing the address means we've lost track of the
           * disk address in a permanent way. This is safe because there's no path to reading the leaf
           * page again: if there's ever a read into this part of the name space again, the cache read
           * function instantiates an entirely new page.)
           */
          if (ref->addr != NULL && !__wt_page_del_active(session, ref, true)) {
              /*
               * Minor memory cleanup: if a truncate call deleted this page and we were ever forced to
               * instantiate the page in memory, we would have built a list of updates in the page
               * reference in order to be able to commit/rollback the truncate. We just passed a
               * visibility test, discard the update list.
               */
              if (page_del != NULL) {
                  __wt_free(session, ref->page_del->update_list);
                  __wt_free(session, ref->page_del);
              }
      
              WT_RET(__wt_ref_block_free(session, ref));
          }
      
          /*
           * If the original page is gone, we can skip the slot on the internal page.
           */
          if (ref->addr == NULL) {
              *statep = WT_CHILD_IGNORE;
              return (0);
          }
      
          /*
           * Internal pages with deletes that aren't stable cannot be evicted, we don't have sufficient
           * information to restore the page's information if subsequently read (we wouldn't know which
           * transactions should see the original page and which should see the deleted page).
           */
          if (F_ISSET(r, WT_REC_EVICT))
              return (__wt_set_return(session, EBUSY));
      
          /* If the page cannot be marked clean. */
          r->leave_dirty = true;
      
          /*
           * If the original page cannot be freed, we need to keep a slot on the page to reference it from
           * the parent page.
           *
           * If the delete is not visible in this checkpoint, write the original address normally.
           * Otherwise, we have to write a proxy record. If the delete state is not ready, then delete is
           * not visible as it is in prepared state.
           */
          if (!__wt_page_del_active(session, ref, false))
              *statep = WT_CHILD_PROXY;
      
          return (0);
      }
      

      Reading a deleted page that is fast truncated back into memory after restart loses the original page_del information because it is stored only in memory before restart. Before durable history, this is fine because we only care about data at the snapshot of the stable timestamp. If the truncate is behind the stable timestamp, it is OK to append the globally visible tombstone to each key as we never try to read the historical data that is truncated. If the truncate is after the stable timestamp, we don't include that in the checkpoint so it also works fine.

      However, with durable history, losing the timestamp information of a fast truncate wrongly removes the key from data store after restart in the above case. In addition, if we write a fast truncate which is after the stable timestamp in the checkpoint, we cannot rollback it in rollback to stable currently.

      I believe this problem is not affecting mongodb currently as mongodb only uses truncate for oplog and it doesn't care about the historical data before stable timestamp after restart at the moment. (Correct me if I'm wrong.)

            Assignee:
            keith.bostic@mongodb.com Keith Bostic (Inactive)
            Reporter:
            chenhao.qu@mongodb.com Chenhao Qu
            Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

              Created:
              Updated:
              Resolved: