State chart for fast-truncate in WT-9252 ---------------------------------------- State M: an ordinary in-memory page. Accessible read-only: true Ref state: WT_REF_MEM Parent address cell: ~anything Parent eviction: not permitted ref->ft_info: always null ref->page->modify: may exist Leaf page dirty: maybe Notes: In-memory pages can't be fast-truncated. - On leaf eviction, can move to state D. (That is, if the page has no overflow items.) State D: a page that can be truncated. Accessible read-only: true (though truncates aren't possible) Ref state: WT_REF_DISK (on disk) Parent address cell: WT_CELL_ADDR_LEAF_NO Parent eviction: permitted ref->ft_info: always null ref->page->modify: does not exist Notes: The page contains no overflow items. - On visit, moves back to state M. The ref state is changed to WT_REF_MEM. - On truncate: moves to state UU. The ref state is changed to WT_REF_DELETED. ref->ft_info.del is populated with a page_del structure. State E: an empty page placeholder. Accessible read-only: true Ref state: WT_REF_DELETED Parent address cell: none (no address) Parent eviction: permitted ref->ft_info: always null pref->page->modifiy: does not exist Notes: New trees are created with a single leaf page in state E. Checkpoint cleanup can also create these pages. - On visit, moves to state M. A fresh page is created. The ref state changes to WT_REF_MEM. - On parent eviction, moves to state X. __wt_rec_child_modify returns WT_CHILD_IGNORE so no cell is written. When the parent page is discarded, the ref goes with it. - On parent checkpoint, remains in state E. __wt_rec_child_modify returns WT_CHILD_IGNORE so no cell is written. However, the in-memory state is unchanged. The ref state remains WT_REF_DELETED. State X: a page that no longer exists at all. Notes: None. State UU: an uncommitted, uninstantiated truncate. Accessible read-only: false Ref state: WT_REF_DELETED (on disk) Parent address cell: WT_CELL_ADDR_LEAF_NO Parent eviction: not permitted ref->ft_info: ref->ft_info.del is valid and not NULL ref->page->modify: does not exist Notes: None. - On visit: instantiates and moves to state UI. The page is loaded into memory and tombstones are attached. The ref state is changed to WT_REF_MEM. The list of tombstones is stored in ref->ft_info.update. ref->page->modify is created. modify->instantiated is set to true. The page_del structure is moved to modify->page_del. The page is marked dirty. - On parent checkpoint: remains in state UU. The truncation is not visible to the checkpoint. __wt_rec_child_modify returns WT_CHILD_ORIGINAL. The existing leaf page address is used in a WT_CELL_ADDR_LEAF_NO cell. - On transaction prepare: remains in state UU. The material in the page_del structure is changed accordingly. (Note that the parent page still cannot be evicted.) - On transaction abort: moves back to state D. ref->ft_info.del is zapped. The ref state goes back to WT_REF_DISK. The page is now perfectly ordinary. - On transaction commit: moves to state CU. ref->ft_info.del is updated accordingly. State UI: an uncommitted, instantiated but unreconciled truncate. Accessible read-only: false Ref state: WT_REF_MEM Parent address cell: WT_CELL_ADDR_LEAF_NO Parent eviction: not permitted ref->ft_info: ref->ft_info.update is valid ref->page->modify: exists ref->page->instantiated: true ref->page->page_del: non-NULL Leaf page dirty: true Notes: modify->page_del holds the page_del structure in case internal page reconciliation needs it. Every value has a tombstone matching the truncation. The tombstones are listed in ref->ft_info.update. - On leaf eviction: will fail because of the uncommitted updates. - On leaf checkpoint: moves to state UR. The page will get written out without the tombstones. The tombstones remain. modify->instantiated is set to false. modify->page_del is discarded. - On parent checkpoint (without leaf checkpoint): stays in state UI. According to modify->page_del, the deletion isn't visible. We write out a(non-deleted) reference to the existing child page. The internal page is left dirty. - On transaction prepare: remains in state UI. The tombstones in ref->ft_info.update are updated accordingly. We also update modify->page_del. The ref state stays WT_REF_MEM. - On transaction abort: we move to state M. The tombstones in ref->ft_info.update are marked aborted. ref->ft_info.update is discarded. We also clear modify->instantiated and modify->page_del. The ref state stays WT_REF_MEM. The page is now perfectly ordinary. - On transaction commit: we move to state CI. The tombstones in ref->ft_info.update are updated with the commit. We also update modify->page_del. The ref state stays WT_REF_MEM. State UR: an uncommitted, instantiated and reconciled truncate. Accessible read-only: false Ref state: WT_REF_MEM Parent address cell: WT_CELL_ADDR_LEAF_NO Parent eviction: not permitted ref->ft_info: ref->ft_info.update is valid ref->page->modify: exists ref->page->instantiated: false ref->page->page_del: NULL Leaf page dirty: true Notes: Every value has a tombstone matching the truncation. The tombstones are listed in ref->ft_info.update. - On leaf eviction: will fail because of the uncommitted updates. - On leaf checkpoint: remains in state UR. If a new disjoint value has been inserted and committed, the page will get written out without the tombstones. Otherwise, the previous page image will be reused The tombstones remain. - On parent checkpoint (without leaf checkpoint): stays in state UI. We write out a (non-deleted) reference to the existing child page. - On transaction prepare: remains in state UR. The tombstones in ref->ft_info.update are updated accordingly. The ref state stays WT_REF_MEM. We also update modify->page_del. - On transaction abort: we move to state M. The tombstones in ref->ft_info.update are marked aborted. ref->ft_info.update is discarded. We clear modify->instantiated and modify->page_del. The page state stays WT_REF_MEM. The page is now perfectly ordinary. - On transaction commit: we move to state M. The tombstones in ref->ft_info.update are updated with the commit. ref->ft_info.update is discarded. We also update modify->page_del. The page state stays WT_REF_MEM. The page is now a perfectly ordinary modified page. State CU: a committed, uninstantiated truncate. Accessible read-only: false Ref state: WT_REF_DELETED (on disk) Parent address cell: WT_CELL_ADDR_LEAF_NO Parent eviction: not permitted ref->ft_info: ref->ft_info.del is valid and not NULL ref->page->modify: does not exist Notes: Eviction of internal pages with committed but not globally visible truncates is prohibited, but I think this is a leftover/mistake/bug. - On visit: instantiates and moves to state CI. The page is loaded into memory and tombstones are attached. The ref state is changed to WT_REF_MEM. ref->page->modify is created. modify->instantiated is set to true. The page_del structure is moved to modify->page_del. ref->ft_info.update is set to NULL. The page is marked dirty if the tree is read-write. - On parent checkpoint: moves to state DU. The truncation is visible to the checkpoint. __wt_rec_child_modify returns WT_CHILD_PROXY. The existing leaf page address is used and a WT_CELL_DEL cell is created containing the info from the page_del structure. The ref state stays WT_REF_DELETED. - On the truncate becoming globally visible: we move to state VU. (This is checked at various points the page is handled.) ref->ft_info.del is discarded. The ref state stays WT_REF_DELETED. State CI: a committed, instantiated but unreconciled truncate. Accessible read-only: false Ref state: WT_REF_MEM Parent address cell: WT_CELL_ADDR_LEAF_NO Parent eviction: not permitted ref->ft_info: ref->ft_info.update is NULL ref->page->modify: exists ref->page->instantiated: true ref->page->page_del: non-NULL Leaf page dirty: true Notes: modify->page_del holds the page_del structure in case internal page reconciliation needs it. Every value has a tombstone matching the truncation. The tombstones are _not_ listed in ref->ft_info.update. - On leaf eviction: moves to state D. The page will be written out with the tombstones. modify->instantiated is set to false. modify->page_del is discarded. The ref state becomes WT_REF_DISK. The in-memory parent page is updated with a new address. The page becomes an ordinary on-disk page. - On leaf checkpoint: moves to state D. Same as eviction. - On parent checkpoint (without leaf checkpoint): moves to state DI. According to modify->page_del, the deletion is visible. However, because the leaf page hasn't ever been reconciled we can't refer to that version of it. We write out a WT_CELL_DEL reference to the existing on-disk page image, using the info in modify->page_del. (This is the case that specifically needs modify->page_del.) The ref state remains WT_REF_MEM. - On the truncate becoming globally visible: moves to state VI. This is not explicitly checked for and nothing overt happens. The ref state remains WT_REF_MEM. State DU: a committed on disk, uninstantiated truncate. Accessible read-only: true Ref state: WT_REF_DELETED (on disk) Parent address cell: WT_CELL_DEL Parent eviction: not permitted ref->ft_info: ref->ft_info.del is valid and not NULL ref->page->modify: does not exist Notes: Eviction of internal pages with committed but not globally visible truncates is prohibited, but I think this is a leftover/mistake/bug. - On visit: instantiates and moves to state DI. The page is loaded into memory and tombstones are attached. The ref state is changed to WT_REF_MEM. ref->page->modify is created. modify->instantiated is set to true. The page_del structure is moved to modify->page_del. ref->ft_info.update is set to NULL. The page is marked dirty if the tree is read-write. - On parent eviction in a readonly tree, if we allowed it: goes to state F. Because the tree is readonly, the parent page is discarded. When the parent is discarded, the ref goes with it. - On parent eviction, if we allowed it: goes to state F. The truncation is visible to the checkpoint. __wt_rec_child_modify returns WT_CHILD_PROXY. The existing leaf page address is used and a WT_CELL_DEL cell is created containing the info from the page_del structure. When the parent is discarded, the ref goes with it. - On parent checkpoint: remains in state DU. The truncation is visible to the checkpoint. __wt_rec_child_modify returns WT_CHILD_PROXY. The existing leaf page address is used and a WT_CELL_DEL cell is created containing the info from the page_del structure. The ref state remains WT_REF_DELETED. - On the truncate becoming globally visible: we move to state VU. (This is checked at various points the page is handled.) ref->ft_info.del is discarded. The ref state remains WT_REF_DELETED. State DR: a committed on disk, uninstantiated truncate, in a read-only tree. Accessible read-only: true Ref state: WT_REF_DISK Parent address cell: WT_CELL_DEL Parent eviction: permitted ref->ft_info: is NULL ref->page->modify: does not exist Notes: This is the same as state DU, except for the ref state. - On visit: instantiates and moves to state DI. The page is loaded into memory and tombstones are attached. The ref state is changed to WT_REF_MEM. ref->page->modify is created. modify->instantiated is set to true. The page_del structure is moved to modify->page_del. ref->ft_info.update is set to NULL. The page is marked dirty if the tree is read-write. (Note that we detect this case by checking the cell type.) - On parent eviction, goes to state F. Because the tree is readonly, the parent page is discarded. When the parent is discarded, the ref goes with it. - On parent checkpoint: not possible because the tree is readonly. - On the truncate becoming globally visible: we move to state VU. (This is checked at various points the page is handled.) ref->ft_info.del is discarded. The ref state remains WT_REF_DELETED. Note that in principle this transition is not possible because the tree is readonly; however, if running readonly (as opposed to reading a checkpoint) it is still as far as I know possible to manipulate oldest and thereby change visibility. State DI: a committed on disk, instantiated but unreconciled truncate. Accessible read-only: true Ref state: WT_REF_MEM Parent address cell: WT_CELL_DEL Parent eviction: not permitted ref->ft_info: ref->ft_info.update is NULL ref->page->modify: exists ref->page->instantiated: true ref->page->page_del: non-NULL Leaf page dirty: true except in readonly trees Notes: modify->page_del holds the page_del structure in case internal page reconciliation needs it. Every value has a tombstone matching the truncation. The tombstones are _not_ listed in ref->ft_info.update. - On leaf eviction in a readonly tree: moves to state DR. The page will be discarded. The ref state becomes WT_REF_DISK. - On leaf eviction in a read-write tree: moves to state D. The page will be written out with the tombstones. modify->instantiated is set to false. modify->page_del is discarded. The ref state becomes WT_REF_DISK. The in-memory parent page is updated with a new address. The page becomes an ordinary on-disk page. - On leaf checkpoint: moves to state D. Same as eviction. - On parent checkpoint (without leaf checkpoint): stays in state DI. According to modify->page_del, the deletion is visible. However, because the leaf page hasn't ever been reconciled we can't refer to that version of it. We write out a WT_CELL_DEL reference to the existing on-disk page image, using the info in modify->page_del. (This is the case that specifically needs modify->page_del.) The ref state remains WT_REF_MEM. - On the truncate becoming globally visible: moves to state VI. This is not explicitly checked for and nothing overt happens. The ref state remains WT_REF_MEM. State VU: a globally visible, uninstantiated truncate. Accessible read-only: true Ref state: WT_REF_DELETED (on disk) Parent address cell: WT_CELL_ADDR_LEAF_NO or WT_CELL_DEL Parent eviction: permitted ref->ft_info: ref->ft_info.del is NULL ref->page->modify: does not exist Notes: Because we need to carry around the original on-disk page's address until the parent internal page is reconciled, if the page is visited the choices are (a) instantiate it in the usual way even though the tombstones we're attaching will be globally visible, or (b) create a fresh page and pretend we loaded it from the original page image. The latter seemed like it had enough ways to go wrong that we chose (for now at least) not to do it. - On visit: instantiates and moves to state VI. The page is loaded into memory and tombstones are attached. The ref state is changed to WT_REF_MEM. ref->page->modify is created. modify->instantiated is set to true. The page_del structure is moved to modify->page_del. ref->ft_info.update is set to NULL. The page is marked dirty if the tree is read-write. - On parent eviction in a readonly tree: moves to state F. Because the tree is readonly, the parent page is discarded. When the parent page is discarded, the leaf ref is discarded along with it. - On parent eviction: moves to state X. The original on-disk image is now discarded. __wt_rec_child_modify returns WT_CHILD_IGNORE. Nothing is placed in the parent page image. When the parent page is discarded, the leaf ref is discarded along with it. - On parent checkpoint: moves to state E. The original on-disk image is now discarded. __wt_rec_child_modify returns WT_CHILD_IGNORE. Nothing is placed in the parent page image. The ref remains in state WT_REF_DELETED, and no longer has an address. State VI: a globally visible, instantiated but unreconciled truncate. Accessible read-only: true Ref state: WT_REF_MEM Parent address cell: WT_CELL_ADDR_LEAF_NO or WT_CELL_DEL Parent eviction: not permitted ref->ft_info: ref->ft_info.update is NULL ref->page->modify: exists ref->page->instantiated: true ref->page->page_del: may be non-NULL Leaf page dirty: true except in readonly trees Notes: modify->page_del holds the page_del structure in case internal page reconciliation needs it. Every value has a tombstone matching the truncation. The tombstones are _not_ listed in ref->ft_info.update. - On leaf eviction in a readonly tree: moves to state VR. The page will be discarded. The ref state becomes WT_REF_DISK. - On leaf eviction: moves to state D. The page will be written out with the tombstones. modify->instantiated is set to false. modify->page_del is discarded. The ref state becomes WT_REF_DISK. The in-memory parent page is updated with a new address. The page becomes an ordinary on-disk page. - On leaf checkpoint: moves to state M. The page will be written out with the tombstones. modify->instantiated is set to false. modify->page_del is discarded. The ref state remains WT_REF_MEM. The in-memory parent page is updated with a new address. The page becomes an ordinary in-memory page. - On parent checkpoint (without leaf checkpoint): moves to state VE. The original on-disk image is now discarded. __wt_rec_child_modify returns WT_CHILD_IGNORE. Nothing is written to the new parent page image. The ref state remains WT_REF_MEM. State VR: a globally visible, uninstantiated truncate, in a read-only tree. Accessible read-only: true Ref state: WT_REF_DISK Parent address cell: WT_CELL_DEL Parent eviction: permitted ref->ft_info: ref->ft_info.del is NULL ref->page->modify: does not exist Notes: This is the same as state VU, except for the ref state. - On visit: instantiates and moves to state VI. The page is loaded into memory and tombstones are attached. The ref state is changed to WT_REF_MEM. ref->page->modify is created. modify->instantiated is set to true. The page_del structure is moved to modify->page_del. ref->ft_info.update is set to NULL. - On parent eviction: moves to state F. Because the tree is readonly, the parent page is discarded. When the parent page is discarded, the leaf ref is discarded along with it. State VE: a globally visible, instantiated but unreconciled truncate with no address. Accessible read-only: false Ref state: WT_REF_MEM Parent address cell: none (no address) Parent eviction: not permitted ref->ft_info: ref->ft_info.update is NULL ref->page->modify: exists ref->page->instantiated: true ref->page->page_del: may be non-NULL Leaf page dirty: true Notes: modify->page_del holds the page_del structure in case internal page reconciliation needs it. Every value has a tombstone matching the truncation. The tombstones are _not_ listed in ref->ft_info.update. - On leaf eviction given additional updates: moves to state D. The page will be written out with any new updates. modify->instantiated is set to false. modify->page_del is discarded. The ref state becomes WT_REF_DISK. The in-memory parent page is updated with a new address. The page becomes an ordinary on-disk page. - On leaf eviction given no updates: moves to state E. The page will reconcile empty and not be written. modify->instantiated is set to false. modify->page_del is discarded. The ref state becomes WT_REF_DELETED. - On leaf checkpoint given additional updates: moves to state M. The page will be written out with any new updates. modify->instantiated is set to false. modify->page_del is discarded. The ref state remains WT_REF_MEM. The in-memory parent page is updated with a new address. The page becomes an ordinary in-memory page. - On leaf checkpoint given no updates: moves to state M. The page will reconcile empty and not be written. modify->instantiated is set to false. modify->page_del is discarded. The ref state remains WT_REF_MEM. The page becomes an ordinary in-memory page. - On parent checkpoint (without leaf checkpoint): impossible. This cannot happen, because we got here via a parent checkpoint with no leaf checkpoint, and there will necessarily be a leaf checkpoint in the next checkpoint. State F: a truncation fully forgotten from memory Ref state: n/a Parent address cell: WT_CELL_DEL Parent eviction: parent not in memory ref->ft_info: nonexistent Notes: Upon restart, or when reading a checkpoint from a checkpoint cursor, all deleted pages are in state F. - When the parent page is visited: moves to state DU. Loading the parent creates refs for its children. For WT_CELL_DEL pages, ref->ft_info.del is created from the information in the cell, and ref->state is set to WT_REF_DELETED. Additional notes: - Checkpointing the internal parent page when the leaf page hasn't been checkpointed can happen if the leaf page is instantiated after the checkpoint starts. That is: 1. checkpoint starts 2. checkpoint tree walk goes past the leaf page 3. leaf page is instantiated 4. checkpoint reconciles the parent internal page - The DR and VR states are reached only when an instantiated page in a readonly tree is discarded by eviction.