[Join Optimization] Use page sampling to estimate average number of docs/page and better estimate for numPages in collection

XMLWordPrintableJSON

    • Type: Improvement
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Query Optimization
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      We've identified that our current estimates for the number of pages in a collection can be quite wrong.

       

      sf   collName  costModelEstimatedPages  btreeLeafPages  pctError
      ---  --------  -----------------------  --------------  --------
      0.1  customer  53.8                     38              41.6%
      0.1  lineitem  2435.5                   4140            -41.2%
      0.1  nation    0.6                      1               -40.0%
      0.1  orders    430.5                    624             -31.0%
      0.1  part      53.8                     45              19.6%
      0.1  partsupp  215.3                    355             -39.4%
      0.1  region    0.6                      1               -40.0%
      0.1  supplier  4                        3               33.3%
      1    customer  608.9                    757             -19.6%
      1    lineitem  19484                    20187           -3.5%
      1    nation    0.6                      1               -40.0%
      1    orders    3444.3                   3938            -12.5%
      1    part      608.9                    1092            -44.2%
      1    partsupp  2435.5                   3660            -33.5%
      1    region    0.6                      1               -40.0%
      1    supplier  38.1                     22              73.2% 

      This is because we currently estimate the number of pages by assuming that WT will create pages of 32K bytes post compression (leaf_page_max). Then we get the number of bytes used in the wt file and divide by 32K to get an estimate of the number of leaf pages. In reality, the leaf pages do not end up being close to 32K (we need to better understand why this is).

       

      An alternative idea is to use sampling to estimate the average number of documents per page and then using the document count to estimate the number of pages. I prototyped this approach here.

      This ticket tracks the work to productionize this solution, which will involve negotiating with the WT team about exposing an API that allows for the cursor to know how many documents live on the current page.

            Assignee:
            Unassigned
            Reporter:
            Ben Shteinfeld
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: