Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-559

Create a more general-purpose block-manager preload call

    • Type: Icon: Task Task
    • Resolution: Done
    • WT1.6.1
    • Affects Version/s: None
    • Component/s: None
    • None

      Create a more general-purpose block-manager preload call and pre-load individual pages instead of using the page locations to guess at the right file chunks.

      agorrod @michaelcahill

      Alex, Michael: I took another look at the cache warming code today.

      There are two problems with it, I think: it's going to behave badly if a file isn't bulk-loaded (potentially attempting to load really big chunks of a file into the cache), second, putting a flag into the WT_SESSION_IMPL and then noticing that flag in the middle of the block manager read function is a pretty evil violation of layering.

      I can't think of any way to really figure out what's going on underneath, the place where we know the tree is bulk-loaded and what blocks are interesting for a pre-load is far away from any place where we could save that information and get it back when we open the tree again.

      So, this new branch is an alternate approach:

      I added code to walk the checkpoint's root page and call a new block-manager function to pre-load in the second level of the tree. We could walk further down the tree than just the second level, of course, but loading the second level is going to pre-load a lot of pages: even for a small block size like 4KB, we'll pre-load somewhere between 400 and 500 pages (a block address on a btree page starts out at 7B).

      I've also changed it so we pre-load all trees, not just mapped checkpoints, using posix_fadvise if it's available, and otherwise doing the read.

              /* Check for a mapped block. */
              mapped = bm->map != NULL && offset + size <= (off_t)bm->maplen;
              if (mapped)
                      WT_RET(__wt_mmap_preload(
                          session, (uint8_t *)bm->map + offset, size));
              else {
      #ifdef HAVE_POSIX_FADVISE
                      if ((ret = posix_fadvise(block->fh->fd,
                          (off_t)offset, (off_t)size, POSIX_FADV_WILLNEED)) != 0)
                              WT_RET_MSG(
                                  session, ret, "%s: posix_fadvise", block->name);
      #else
                      WT_DECL_ITEM(tmp);
                      WT_RET(__wt_scr_alloc(session, size, &tmp));
                      ret = __wt_block_read_off(
                          session, block, tmp, offset, size, cksum);
                      __wt_scr_free(&tmp);
                      WT_RET(ret);
      #endif
              }
      

      I should point out this change will pre-load leaf pages, if the tree is small.

      This change means we're doing individual system calls to pre-load each page, instead of one big call for a range of pages, on the other hand, we're not actually reading any pages into memory to figure out the shape of the tree.

      I don't know what test the original change helped, or how much it helped, but I'd be a lot more comfortable with this approach instead of what we have in the tree at the moment.

            Assignee:
            alexander.gorrod@mongodb.com Alexander Gorrod
            Reporter:
            keith.bostic@mongodb.com Keith Bostic (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: