Major - P3
WiredTiger doesn't include an interface to return row- or byte-count information for objects or key ranges. This is generally useful, for example when choosing a join order. Recent discussions with redbeard0531 and michael.cahill also described other potential uses for this functionality.
The referenced branch (linked here) implements an approach to adding this functionality.
The first part of the change is to add leaf page row- and byte-counts to each address cookie stored in the internal pages during reconciliation, and those counts are then subsequently aggregated into each new internal page to the root. At checkpoint, the counts are stored with the metadata information for the object.
The second part of the change is a new API:
The API is similar to WT_SESSION.truncate, taking either the name of an object or a pair of cursors that specify a range within the object. If the name of the object is specified, the row- and byte-counts are returned from the most recent checkpoint of the object. These row- and byte-counts are exact as of that checkpoint and won't change until another checkpoint is performed.
If a pair of cursors are specified, an estimate of the row- and byte-counts for the cursor range is returned. A NULL cursor implies the beginning or end of the tree. The cursors don't have to be positioned, calling WT_CURSOR.set_key is sufficient.
The algorithm for estimating the cursor range information is to descend the tree to find the first internal page where the cursors diverge into different sub-trees. At that point, the row- and byte-counts between the two cursor positions on that page are aggregated and returned. This is an estimate of the values, of course, as it assumes a balanced tree with uniform distribution of rows and bytes. Leaf pages are ignored: cursors that don't diverge until reaching a leaf page, are simply assumed to split the range in half.
- The byte-count is currently the in-memory size requirements for the page, that is, it's the value saved when a page is written, as the amount of memory needed to bring the page back into memory. There's nothing special about that choice, it could be changed, for example, it could include the overflow key/value memory requirements of the subtree. Or, there could be separate values for in-memory and on-disk requirements for the subtree.
- The branch doesn't include any upgrade/downgrade support. As this work changes the address cookie stored on internal pages, this is a page-format change and requires upgrade/downgrade support. Additionally, keith.smith and donald.anderson have discussed changes to the address cookie as part of their work in tiered storage. We'd have to update the Btree page version number and modify this code to support both old and new pages: it's not difficult and I ignored it so I could easily assert every address cookie had the new information. Downgrade could be handled by adding code to previous releases that recognized the new page version and handled the new style of address cookies. (I anticipate the code added to older releases would be relatively simple to add and test, on the order of 20 lines of code in the block manager.)
- There is no performance cost to this change: the concern is row-store internal page reconciliation must decode on-page cells when incorporating them into a new version of the page. As the current internal page reconciliation code already decodes all of the cells to acquire time window aggregation information, there is little additional cost to this change.
- The performance of the new API is roughly the same as a random tree search, as it's simply two binary searches of some small number of internal pages from the tree, some of which are certain to be in cache already.
- The branch includes partial column-store support. While the row count isn't useful for a column store object, the byte count is as useful as it is in row store. I would suggest that none of this information be supported for fixed-length column store, and variable-length column store be handled identically to row store.
- The change is reasonably small: it's about 1500 LOC, evenly divided between the new API and the code to support changing the address cookies and aggregating them up the tree.
- The Python test is just a smoke-test: to make it real, there needs to be way for the WT_SESSION.range_stat Python interface to return multiple values.
- The branch requires a new copy of the tree search function, but doesn't re-factor the existing search functions to share code. That might be worth doing, but I'd be hesitant to do that, the existing search functions are special-purpose and relatively carefully tuned.