Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-9859

Deduplicate the page skip code for deleted pages

    • Type: Icon: Technical Debt Technical Debt
    • Resolution: Unresolved
    • Priority: Icon: Minor - P4 Minor - P4
    • None
    • Affects Version/s: None
    • Component/s: None
    • Labels:
      None

      There are two functions used for skipping deleted pages during tree walk: wt_btcur_skip_page, used by cursor_next/prev, and wt_delete_page_skip, used implicitly by everything that isn't RTS, including, redundantly, cursor_next/prev.

      They are not quite the same but do mostly the same checks, including locking the ref to check the page-delete information, which isn't free.

      Point 1: it isn't clear if they really need to be different. The most significant difference is that wt_btcur_skip_page also checks the time aggregate to allow skipping pages that are not deleted as such but have no visible content anyway; maybe there's a reason that's not suitable for other tree walks (besides RTS) but if so it should be discovered and documented.

      Point 2: the common checks should be deduplicated. In addition to not checking the page-delete information twice, we should really only lock the ref once. This suggests that if the time aggregate check is not suitable for other tree walks it should probably become a flag rather than a custom skip function.

      Note 1: wt_btcur_skip_page unconditionally returns false on FLCS trees, because on FLCS iteration has to step through otherwise nonexistent areas of its key space returning zero. This is not necessary for the deleted-page checks; FLCS doesn't support fast truncate because of this iteration issue, so deleted pages don't appear at all anyway. However, it is almost certainly needed to shortcut the time aggregate check. While in the abstract FLCS pages should contain zeros rather than nonzero values with stop times, a nonzero value with a stop time is considerably cheaper to store than a zero value combined with a history store entry for the nonzero value. Therefore, stop times do appear and need to be accounted for in the rest of the system. Consequently, if the time aggregate check gets moved to wt_delete_page_skip the VLCS check needs to go with it.

      Note 2: wt_btcur_skip_page returns false unconditionally for all internal pages. This applies only to the time aggregate check; the time aggregate isn't necessarily up to date even if the page isn't marked dirty. Therefore, if the time aggregate check gets moved to wt_delete_page_skip some form of this logic needs to go with it.

      Note 3: while internal pages are never fast-truncated, they can be deleted, e.g. if they reconcile empty. In these cases ref->page_del will always be null, that is, the deletion is always globally visible, and they can always be skipped.

      Note 4: currently wt_btcur_skip_page does not handle the visible_all case, because I guess it's never called from a visible_all context; be aware of that if transplanting parts of it.

      Note 5: wt_btcur_skip_page contains a comment about locking the ref to access the time aggregate, but the time aggregate is accessed via wt_ref_addr_copy, which is AFAIK supposed to be safe without locking the ref. Locking the ref is, however, required to access ref->page_del.

            Assignee:
            backlog-server-storage-engines [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            dholland+wt@sauclovia.org David Holland
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: