[Join Optimization] Index scan costing - don't consider each fetched document a random IO

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Fixed
    • Priority: Major - P3
    • 9.0.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • None
    • Query Optimization
    • Fully Compatible
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Similar to SERVER-121256, the join cost model is currently overestimating the number of random IOs that single table index scans are performing. We currently invoke mackert lohman with docsOutput. This ignores the fact that for a single index key, all index entries are clustered by RID, which perform sort-sparse IO on the collection. We need to instead estimate the NDV of index keys for the scan. This can be done by estimating the NDV for index keys and dividing by the selectivity of the range.

      There are two key technical challenges here:

      1. Arbitrary index scans may contain multikey fields. Our current NDV estimator assumes all fields are non-multikey. I think that SERVER-122379 should address this challenge.
      2. Neither JoinCostEstimator nor the JoinCardinalityEstimator does not have access to the single table QSNs which we'll need to estimate the NDV of the index keys. This may require some refactoring.

      The other thing this ticket should do is invoke the Yao formula to get the number of distinct pages the fetch will read.

      After this ticket, we may require a ticket similar to SERVER-122265 which accounts for the sorted-sparse IO for single table index scans.

            Assignee:
            Ben Shteinfeld
            Reporter:
            Ben Shteinfeld
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: