Type: Build Failure
Priority: Major - P3
Affects Version/s: None
- When a row-store page is read into memory, we find the largest group of prefix-compressed keys we
can build from a previous key on the page plus the per-key suffix bytes, and save the start/stop
information for the group in the WT_PAGE structure. It doesn't cost anything when the page is
read into memory (it's just updating a few local variables in the main loop). It costs 8B in
the WT_PAGE structure (80B to 88B), plus moving the page's read generation out of the "hot"
chunk of the WT_PAGE structure.
When we look up a key on a leaf page, if the key is part of the page's group of prefix-compressed
keys, we unpack the key's on-page cell. If the key's prefix is entirely found in the group's
starting key, we copy the key's prefix and suffix into the caller's buffer and don't instantiate
anything in memory. The cost here is (1) unpacking an on-page cell and (2) doing two memory
copies into the caller's buffer every time we search for a key in the prefix group. Previously,
we built the key once and instantiated it in allocated memory on the first search, and subsequent
searches would point the caller's buffer at that instantiated memory instead of doing any cell
unpacking or memory copies. Obviously, this is a real cost, on second and subsequent searches
we're trading work to avoid the memory amplification. It wouldn't be difficult to detect a page
being repeatedly searched if we wanted to do that, and instantiating these keys in that case,
which would focus this change on point reads/updates vs. hot-cache workloads.
When we reconcile a page, we don't increase the key prefix compression if the previous key was also
prefix compressed and we don't gain at least 10 bytes over the previous key's compression. In
other words, if we had a key prefix, and the next key's prefix isn't at least 10B longer,
we use the same prefix for both keys. In my smoke tests, common strings plus appended record
numbers generate short groups of keys with a common prefix without this change, and with this
change they generate long groups of keys with a common prefix. Obviously, this is a real cost
as well, although the block compression of the page should pick up the common strings we are
choosing not to remove as part of prefix-compression.
- Compiler complaints about uninitialized variables.
- rec_row.c:157:26: error: implicit conversion loses integer precision: 'u_int' (aka 'unsigned int') to
'uint8_t' (aka 'unsigned char') [-Werror,-Wconversion]
- Reset prefix_count (the count of the elements in the prefix group) on every fully-instantiated
key, not just if the current prefix group is the largest, otherwise we could choose the wrong
Set prefix_start (the starting slot of the prefix group), on every fully-instantiated key, if
the starting key is followed by an overflow key, (slot - 1) on the first prefix slot might not
match the starting key.
- Overflow records can be included in prefix groups, skip them during key instantiation.
Don't call __wt_buf_grow and assume the WT_ITEM.data field will reference allocated memory.
If the WT_ITEM.data field points to memory outside the buffer and we don't grow the buffer
(resetting the WT_ITEM.data field), we'll overwrite that memory.
- Add a new configuration item, btree.prefix, which sets a common prefix on all keys of between
15 and 80 bytes. The btree.prefix configuration string also sets the btree.prefix_compression
configuration, if it's not explicitly configured.
Fix a bug in general format configuration: the code assumed the library "prefix_compression"
default value was true, and it's actually false, which meant prefix compression is not being
tested in most cases.
Don't do any key generation work unless it's a row-store, that simplifies things. (I think it's
safe, but I'm not sure, column-store is currently broken so it's hard to test. I've added an
assert, so if I'm wrong it should be easy to find.)
- It's too tricky to track increasing/decreasing key prefix compression, ignore until it's an
- Restructure in-memory key/value encoding to incorporate key prefix compression. The previous
encoding supported 512B keys and 8KB values on 1MB pages; the new encoding supports 4KB keys
plus key prefixes and 8KB values, but decreases the maximum supported page size to 128KB.
Change __wt_row_leaf_key_info() to return a key prefix when returning a physical key.
Change __wt_row_leaf_key_info() to not return a value. The calling convention is we get back a
reference to a cell, a reference to a WT_IKEY structure with an instantiated key plus a cell,
or a key, length and prefix triplet. This code is both ugly and confusing, but it needs to be
a fast code path.
Merge the functions that set the row-store WT_ROW reference when a page is read
(__wt_row_leaf_key_set() and __wt_row_leaf_key_set_cell()), into a single call by checking for
overflow keys in a common function rather than in the page-read function.
Change __wt_row_leaf_key_work() to handle on-page encodings that include prefixes. Where the
key can be immediately built from the page-wide prefix we calculated when the page was read,
do that. Otherwise, the prefix has to be rolled-forward or backward exactly like a prefix read
from a cell.
Change __cursor_row_slot_key_return() to handle prefix-encoded keys.
Change row-store leaf page reconciliation to handle prefix-encoded keys.
Minor changes along the way:
Change __cursor_row_slot_key_return() to remove unused variable kpack_used.
Change __wt_row_leaf_key_work() to not crack the key cell multiple times in the case of an
overflow key, it won't change under us, there's no reason.
Push the function to discard instantiated keys down into the btree_inline code to minimize
information in the upper layers of the btree code.
Don't initialize the time window in __wt_read_row_time_window() if it's going to be overwritten.
Rename __wt_row_leaf_value_exists() to __wt_row_leaf_value_is_encoded(), the test isn't if a
value exists, it's a fast path test for when the value is encoded and we aren't going to crack
the cell to find the time window.
- Reformat the comments to match 100 column windows.
- Remove cleanup reminder.
- No caller of __wt_row_leaf_value_cell() ever passes in a non-NULL kpack argument, remove it.
Clarify the relationship between __wt_row_leaf_value() and __wt_row_leaf_value_cell().
- Attempt to avoid key_data/key_size/key_prefix uninitialized warnings.
- At some point, all uses of the "tmp" scratch buffer were removed, the variable and error handling
are no longer needed, remove them.
- Assert we never attempt to copy more data than will fit into the buffer, it turns a nasty
buffer overflow bug into an assert.
- Restructure the in-memory key/value encoding to make it possible to retrieve the on-page cell
from all key encodings. The problem this solves is we have to be able to find the on-page key
cell after instantiating a key (for example, to step past it to the value cell), which implies
knowing the cell location when instantiating a key. This wasn't previously a problem because only
references to on-page cells were instantiated, that is, the cell location was always available
during the instantiation step. Adding key prefixes into the key encoding scheme changed that:
we might want to instantiate a key-prefix compressed key, and by encoding the key, we've given
up the on-page cell location. The trick is to always include a reference to the on-page cell
in each encoding and store additional offsets to other pieces of the cell.
The change is wide-ranging because callers of the underlying key retrieval functions used the
fact of the cell being available in an encoding to indicate information about the encoding,
and this is no longer the case. There are additional changes to take advantage of earlier
changes that allow us to more quickly create full keys from prefix-compressed keys, based on
a page-wide, known prefix.
- conversion from 'uintptr_t' to 'uint8_t', possible loss of data
- Fix the comments to make sure nobody removes the buffer truncation again.
- Review and fix some comments.
- declaration ordering.
- Remove a comment that's no longer needed.
- Update a comment.
- Update a comment to explain why the group-size and group-prefix information is ignored.
- Use spaces/tabs consistently so the diff looks cleaner.
- Review comments, use a single comment for readability.
- Fix a comment.
- Fix a comment for consistency.
- Split the key space into power-of-2 buckets: that results in tiny runs of prefix strings at
the beginning of the tree, and increasingly large common prefixes as the tree grows (with a
testing sweet spot in the middle).
- Clean up a comment.
- Variable declaration ordering.
- Remove an unnecessary comment.
- In the fast-path for building key from previously built keys, take into account the possibility
that the key we most recently built was an overflow key.
- Add zero-length values to the key/value in-memory encoding so we don't have to decode cells in
order to instantiate zero-length values.
Fix some uses of __wt_buf_grow() where there isn't any reason to consider any existing buffer
contents, we're overwriting the entire buffer, not just part of it.
When growing a buffer, take any existing data offset into account when deciding if more memory
needs to be allocated, that avoids complex calculations by the caller to figure out if there's
enough memory to append to an existing buffer where the data pointer is offset from the start
of the allocated memory.
Fix two places where WT_PTRDIFF(end, begin) arguments were flipped so we never encoded the
key/value pair information, instead we always decoded the on-page cells.
- We can race with reconciliation deleting overflow keys. Deleted overflow keys must be instantiated
before deletion, acquire the overflow lock and check. If the key has been deleted, restart the
slot and get the instantiated key, else read the key while before releasing the lock.
- Remove the zero-length value encoding, it's slightly slower than the previous code that checks the
following cell's type byte and fast-paths when it's a key. | 12 May 21 05:49 UTC
Evergreen Subscription: ; Evergreen Event: