[Join Optimization] Costing of INLJ ignores cost of sorted-sparse I/O to fetch records from collection

XMLWordPrintableJSON

    • Type: Improvement
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Query Optimization
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      In SERVER-121256, we lowered the random I/O estimate of INLJ to be equal to the number of probes that will be performed (before the random I/O estimate was the number of documents returned). This is important because both collections and index entries in MongoDB are clustered by RID, meaning that when fetching all records for a single index key, the query engine fetches the records in RID order, which results in a "sorted and sparse" access pattern instead of a truly random IO access pattern.

      However, the join cost model currently ignores the cost of this sorted-sparse IO. In the PR, we found a regression in Q7 of TPC-H (note this version excludes nation2 from the join graph) where we started choosing an INLJ for the supplier - lineitem join instead of a HJ. From the execution stats, we found that we were reading the entirety of the lineitem collection in the INLJ, which is clearly worse than performing a HJ. The LHS of the INLJ is HJ(nation, supplier) which returns 1000 documents, so our cost model estimates ~1000 random I/Os, but in reality this probe returns all 600k lineitems.

      After incorporating this into the cost model, we should expect that for Q7 we will choose HJ(HJ(N, S), L) instead of INLJ(HJ(N, S), L).

            Assignee:
            Ben Shteinfeld
            Reporter:
            Ben Shteinfeld
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: