Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-20616

Plan ranker sampling from the beginning of a query's execution can result in poor plan selection

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major - P3
    • Resolution: Unresolved
    • Affects Version/s: 2.6.11, 3.0.6, 3.2.0
    • Fix Version/s: Backlog
    • Component/s: Querying
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Case:

      Description

      The query engine currently selects a winning plan among n candidate plans by the following mechanism. We start executing each of the n plans for some short trial period. Once the trial period ends, we use stats collected from the brief trial execution to score the plans. The most highly scored plan becomes the winner. The scoring is based on a notion of "productivity", essentially the ratio of results produced per amount of work (where work is an artificial unit that is a proxy for CPU + IO time).

      You can think of the trial execution period as a way to sample the data on the fly, a kind of adaptive query processing technique in which inferences about the data are made while the query is executing. The problem is that this sampling strategy may not accurately represent the true data distribution.

      For example, suppose that some candidate plan p1 is scanning the index {a: 1} over the interval [0, 100000] in order to answer the query predicate {a: {$gte: 0, $lte: 100000}, b: "selectiveString"}. There is an alternative plan p2 which simply looks up all docs where "b" is equal to some rare string, "selectiveString", in the {b: 1} index. Clearly p2 is the better plan: instead of scanning some large index range and filtering out all the documents without the correct value for "b", we do a lookup based on the more selective predicate.

      However, depending on the data, the plan ranking algorithm described above might lead us to choose suboptimal plan p1 instead of p2. Suppose that there is a correlation between the "a" and "b" fields so that all of the documents where "b" is "selectiveString" also have an "a" value of 0. Since p1 is scanning index {a: 1} in order, we encounter all the documents where "a" is 0 and "b" is "selectiveString" first during the trial execution period. This can spuriously make p1 look as good as p2! If instead we were to sample randomly from the data where "a" is on the interval [0, 10000], the query engine would have quickly observed that plan p1 is much slower than p2. Real data is often correlated, so despite the fairly contrived example, this can be a problem is practice.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              backlog-server-query Backlog - Query Team
              Reporter:
              david.storch David Storch
              Participants:
              Votes:
              6 Vote for this issue
              Watchers:
              34 Start watching this issue

                Dates

                Created:
                Updated: