Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-7531

Treat update restore eviction as a progress



    • Build Failure
    • Status: Closed
    • Major - P3
    • Resolution: Fixed
    • None
    • WT10.0.1, 4.4.7, 5.0.0-rc1, 5.1.0-rc0
    • None
    • 5
    • Storage - Ra 2021-05-31, Storage - Ra 2021-06-14


      unit-test-with-compile failed on OS X 10.14

      Host: macos-1014-59.macstadium.build.10gen.cc
      Project: WiredTiger (develop)
      Commit: diff: WT-7234 prefix-compressed keys and memory amplification (#6483)

      • When a row-store page is read into memory, we find the largest group of prefix-compressed keys we
        can build from a previous key on the page plus the per-key suffix bytes, and save the start/stop
        information for the group in the WT_PAGE structure. It doesn't cost anything when the page is
        read into memory (it's just updating a few local variables in the main loop). It costs 8B in
        the WT_PAGE structure (80B to 88B), plus moving the page's read generation out of the "hot"
        chunk of the WT_PAGE structure.

      When we look up a key on a leaf page, if the key is part of the page's group of prefix-compressed
      keys, we unpack the key's on-page cell. If the key's prefix is entirely found in the group's
      starting key, we copy the key's prefix and suffix into the caller's buffer and don't instantiate
      anything in memory. The cost here is (1) unpacking an on-page cell and (2) doing two memory
      copies into the caller's buffer every time we search for a key in the prefix group. Previously,
      we built the key once and instantiated it in allocated memory on the first search, and subsequent
      searches would point the caller's buffer at that instantiated memory instead of doing any cell
      unpacking or memory copies. Obviously, this is a real cost, on second and subsequent searches
      we're trading work to avoid the memory amplification. It wouldn't be difficult to detect a page
      being repeatedly searched if we wanted to do that, and instantiating these keys in that case,
      which would focus this change on point reads/updates vs. hot-cache workloads.

      When we reconcile a page, we don't increase the key prefix compression if the previous key was also
      prefix compressed and we don't gain at least 10 bytes over the previous key's compression. In
      other words, if we had a key prefix, and the next key's prefix isn't at least 10B longer,
      we use the same prefix for both keys. In my smoke tests, common strings plus appended record
      numbers generate short groups of keys with a common prefix without this change, and with this
      change they generate long groups of keys with a common prefix. Obviously, this is a real cost
      as well, although the block compression of the page should pick up the common strings we are
      choosing not to remove as part of prefix-compression.

      • Compiler complaints about uninitialized variables.
      • rec_row.c:157:26: error: implicit conversion loses integer precision: 'u_int' (aka 'unsigned int') to
        'uint8_t' (aka 'unsigned char') [-Werror,-Wconversion]
      • Reset prefix_count (the count of the elements in the prefix group) on every fully-instantiated
        key, not just if the current prefix group is the largest, otherwise we could choose the wrong
        prefix group.

      Set prefix_start (the starting slot of the prefix group), on every fully-instantiated key, if
      the starting key is followed by an overflow key, (slot - 1) on the first prefix slot might not
      match the starting key.

      • Overflow records can be included in prefix groups, skip them during key instantiation.

      Don't call __wt_buf_grow and assume the WT_ITEM.data field will reference allocated memory.
      If the WT_ITEM.data field points to memory outside the buffer and we don't grow the buffer
      (resetting the WT_ITEM.data field), we'll overwrite that memory.

      • Add a new configuration item, btree.prefix, which sets a common prefix on all keys of between
        15 and 80 bytes. The btree.prefix configuration string also sets the btree.prefix_compression
        configuration, if it's not explicitly configured.

      Fix a bug in general format configuration: the code assumed the library "prefix_compression"
      default value was true, and it's actually false, which meant prefix compression is not being
      tested in most cases.

      Don't do any key generation work unless it's a row-store, that simplifies things. (I think it's
      safe, but I'm not sure, column-store is currently broken so it's hard to test. I've added an
      assert, so if I'm wrong it should be easy to find.)

      • It's too tricky to track increasing/decreasing key prefix compression, ignore until it's an
        issue somewhere.
      • Restructure in-memory key/value encoding to incorporate key prefix compression. The previous
        encoding supported 512B keys and 8KB values on 1MB pages; the new encoding supports 4KB keys
        plus key prefixes and 8KB values, but decreases the maximum supported page size to 128KB.

      Change __wt_row_leaf_key_info() to return a key prefix when returning a physical key.

      Change __wt_row_leaf_key_info() to not return a value. The calling convention is we get back a
      reference to a cell, a reference to a WT_IKEY structure with an instantiated key plus a cell,
      or a key, length and prefix triplet. This code is both ugly and confusing, but it needs to be
      a fast code path.

      Merge the functions that set the row-store WT_ROW reference when a page is read
      (__wt_row_leaf_key_set() and __wt_row_leaf_key_set_cell()), into a single call by checking for
      overflow keys in a common function rather than in the page-read function.

      Change __wt_row_leaf_key_work() to handle on-page encodings that include prefixes. Where the
      key can be immediately built from the page-wide prefix we calculated when the page was read,
      do that. Otherwise, the prefix has to be rolled-forward or backward exactly like a prefix read
      from a cell.

      Change __cursor_row_slot_key_return() to handle prefix-encoded keys.

      Change row-store leaf page reconciliation to handle prefix-encoded keys.

      Minor changes along the way:

      Change __cursor_row_slot_key_return() to remove unused variable kpack_used.

      Change __wt_row_leaf_key_work() to not crack the key cell multiple times in the case of an
      overflow key, it won't change under us, there's no reason.

      Push the function to discard instantiated keys down into the btree_inline code to minimize
      information in the upper layers of the btree code.

      Don't initialize the time window in __wt_read_row_time_window() if it's going to be overwritten.
      Rename __wt_row_leaf_value_exists() to __wt_row_leaf_value_is_encoded(), the test isn't if a
      value exists, it's a fast path test for when the value is encoded and we aren't going to crack
      the cell to find the time window.

      • Reformat the comments to match 100 column windows.
      • Remove cleanup reminder.
      • No caller of __wt_row_leaf_value_cell() ever passes in a non-NULL kpack argument, remove it.

      Clarify the relationship between __wt_row_leaf_value() and __wt_row_leaf_value_cell().

      • Attempt to avoid key_data/key_size/key_prefix uninitialized warnings.
      • At some point, all uses of the "tmp" scratch buffer were removed, the variable and error handling
        are no longer needed, remove them.
      • Assert we never attempt to copy more data than will fit into the buffer, it turns a nasty
        buffer overflow bug into an assert.
      • Restructure the in-memory key/value encoding to make it possible to retrieve the on-page cell
        from all key encodings. The problem this solves is we have to be able to find the on-page key
        cell after instantiating a key (for example, to step past it to the value cell), which implies
        knowing the cell location when instantiating a key. This wasn't previously a problem because only
        references to on-page cells were instantiated, that is, the cell location was always available
        during the instantiation step. Adding key prefixes into the key encoding scheme changed that:
        we might want to instantiate a key-prefix compressed key, and by encoding the key, we've given
        up the on-page cell location. The trick is to always include a reference to the on-page cell
        in each encoding and store additional offsets to other pieces of the cell.

      The change is wide-ranging because callers of the underlying key retrieval functions used the
      fact of the cell being available in an encoding to indicate information about the encoding,
      and this is no longer the case. There are additional changes to take advantage of earlier
      changes that allow us to more quickly create full keys from prefix-compressed keys, based on
      a page-wide, known prefix.

      • conversion from 'uintptr_t' to 'uint8_t', possible loss of data
      • Fix the comments to make sure nobody removes the buffer truncation again.
      • Review and fix some comments.
      • declaration ordering.
      • Remove a comment that's no longer needed.
      • Update a comment.
      • Update a comment to explain why the group-size and group-prefix information is ignored.
      • Use spaces/tabs consistently so the diff looks cleaner.
      • Review comments, use a single comment for readability.
      • Fix a comment.
      • Fix a comment for consistency.
      • Split the key space into power-of-2 buckets: that results in tiny runs of prefix strings at
        the beginning of the tree, and increasingly large common prefixes as the tree grows (with a
        testing sweet spot in the middle).
      • Clean up a comment.
      • Variable declaration ordering.
      • Remove an unnecessary comment.
      • In the fast-path for building key from previously built keys, take into account the possibility
        that the key we most recently built was an overflow key.
      • Add zero-length values to the key/value in-memory encoding so we don't have to decode cells in
        order to instantiate zero-length values.

      Fix some uses of __wt_buf_grow() where there isn't any reason to consider any existing buffer
      contents, we're overwriting the entire buffer, not just part of it.

      When growing a buffer, take any existing data offset into account when deciding if more memory
      needs to be allocated, that avoids complex calculations by the caller to figure out if there's
      enough memory to append to an existing buffer where the data pointer is offset from the start
      of the allocated memory.

      Fix two places where WT_PTRDIFF(end, begin) arguments were flipped so we never encoded the
      key/value pair information, instead we always decoded the on-page cells.

      • We can race with reconciliation deleting overflow keys. Deleted overflow keys must be instantiated
        before deletion, acquire the overflow lock and check. If the key has been deleted, restart the
        slot and get the instantiated key, else read the key while before releasing the lock.
      • Remove the zero-length value encoding, it's slightly slower than the previous code that checks the
        following cell's type byte and fast-paths when it's a key. | 12 May 21 05:49 UTC
        Evergreen Subscription: ; Evergreen Event:

      Task Logs


        Issue Links



              haribabu.kommi@mongodb.com Haribabu Kommi
              xgen-evg-user Xgen-Evergreen-User
              0 Vote for this issue
              5 Start watching this issue