Improve NDVs computation speed

    • Type: Improvement
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Query Optimization
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      The current NDV estimation interface assumes that we first collect the sample, and then we count the NDVs for each tuple separately. This is simple and flexible, but it will add significant overhead since each pass over the sample is expensive.

      This ticket is to improve the performance of NDV computation by either: doing all NDV estimates together in a single pass over the sample or, even more efficient, doing this inside sample collection. This will require collecting all field name tuples for NDV computation up front.

      There are several other ideas for performance improvements here:

      • When counting NDV within the sample, try to reduce the number of passes over a single document (we could try to re-use the SBE expression field path work, but it would require the sample to be persisted)
      • Refactor the method-of-moments newton-raphson iteration to do the most expensive computation only once per iteration 

            Assignee:
            Unassigned
            Reporter:
            Hana Pearlman
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: