Deleted page block is leaked on-disk in disagg when we skip reading it

XMLWordPrintableJSON

    • Storage Engines - Transactions
    • SE Transactions - 2026-01-16
    • 5

          /*
           * If the page is deleted and the deletion is globally visible, don't bother reading and
           * explicitly instantiating the existing page. Get a fresh page and pretend we got it by reading
           * the on-disk page. Note that it's important to set the instantiated flag on the page so that
           * reconciling the parent internal page knows it was previously deleted. Otherwise it's possible
           * to write out a reference to the original page without the deletion, which will cause it to
           * come back to life unexpectedly.
           *
           * Setting the instantiated flag requires a modify structure. We don't need to mark it dirty; if
           * it gets discarded before something else modifies it, eviction will see the instantiated flag
           * and set the ref state back to WT_REF_DELETED.
           *
           * Skip this optimization in cases that need the obsolete values. To minimize the number of
           * special cases, use the same test as for skipping instantiation below.
           */
          if (previous_state == WT_REF_DELETED && !F_ISSET(btree, WT_BTREE_SALVAGE | WT_BTREE_VERIFY)) {
              /*
               * If the deletion has not yet been found to be globally visible (page_del isn't NULL),
               * check if it is now, in case we can in fact avoid reading the page. Hide prepared deletes
               * from this check; if the deletion is prepared we still need to load the page, because the
               * reader might be reading at a timestamp early enough to not conflict with the prepare.
               * Update oldest before checking; we're about to read from disk so it's worth doing some
               * work to avoid that.
               */
              WT_ERR(__wt_txn_update_oldest(session, WT_TXN_OLDEST_STRICT | WT_TXN_OLDEST_WAIT));
              if (ref->page_del != NULL && __wt_page_del_visible_all(session, ref->page_del, true))
                  __wt_overwrite_and_free(session, ref->page_del);
      
              if (ref->page_del == NULL) {
                  WT_ERR(__wti_btree_new_leaf_page(session, ref));
                  WT_ERR(__wt_page_modify_init(session, ref->page));
                  ref->page->modify->instantiated = true;
                  goto skip_read;
              }
          }
      

      We have this code optimization by skipping read a deleted page if the deletion is globally visible by creating an empty page instead in-memory. However, this leads us to leak the block on disk as the page id of the deleted page is lost.

            Assignee:
            Chenhao Qu
            Reporter:
            Chenhao Qu
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: