We need to review how we define page sizes in WiredTiger, and write some documentation on it for tuning.
Here's the situation right now – you can specify 5 things:
- an initial page size (for both leaf & internal pages),
- a maximum page size (for both leaf & internal pages), and
- an allocation unit, that is, the block size we use for allocating from the underlying file.
Here's what these 5 sizes do:
- The allocation unit is the smallest piece we'll allocate from the underlying file. So, when you pick an allocation unit, you're saying how big the file can get, and you're deciding how much space gets wasted, on average, for an overflow item. The default is a 512B allocation unit, and since we use 32-bit block offsets, you can create a file up to 2TB (2^9 * 2^32). The maximum allocation unit is 512MB, which allows files up to 2EB. Obviously, if you have 512MB allocation units, overflow items could waste a big chunk of space (if you ever had one).
- The maximum page sizes tell us when we're going to split during page reconciliation. When an in-memory page is reconciled, we allocate a maximum page size chunk of memory, and then we reconcile the in-memory page into it. When we fill that maximum page size, it causes a split of the page, and a new page is inserted into the tree.
- The minimum page size is largely unused, the only thing we use it for is to figure out the overflow size. It's easier to show you that code than to explain it:
In other words, we take the minimum page sizes, hit them with a guess at how deep we want a tree to go, and that determines the overflow size.
I've been thinking it would be better to replace the minimum page sizes with explicit overflow sizes. I think that will be easier to talk about and understand for tuning purposes.
In that design, here are the 5 knobs and what they mean:
1. allocation unit: the unit of allocation from the file; if you keep it small, the maximum file size is limited, but you don't waste as much room on overflow items (unchanged from before)
2. maximum leaf page size: the size at which we split leaf pages, that is, no leaf page grows larger than this (unchanged)
3. maximum internal page size: the size at which we split internal pages, that is, no internal page grows larger than this (unchanged)
4. internal overflow size: any key that's larger than this size gets stored as an overflow item
5. leaf overflow size: any key or data item that's larger than this size gets stored as an overflow item
We'll leave the code that figures out an overflow item size as it currently is, if the application doesn't specify an overflow item size, then that will give us a number to use.