-
Type:
Improvement
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Query Optimization
-
None
-
None
-
None
-
None
-
None
-
None
-
None
The current NDV estimation interface assumes that we first collect the sample, and then we count the NDVs for each tuple separately. This is simple and flexible, but it will add significant overhead since each pass over the sample is expensive.
This ticket is to improve the performance of NDV computation by either: doing all NDV estimates together in a single pass over the sample or, even more efficient, doing this inside sample collection. This will require collecting all field name tuples for NDV computation up front.
There are several other ideas for performance improvements here:
- When counting NDV within the sample, try to reduce the number of passes over a single document (we could try to re-use the SBE expression field path work, but it would require the sample to be persisted)
- Refactor the method-of-moments newton-raphson iteration to do the most expensive computation only once per iteration
- related to
-
SERVER-112233 Generate an appropriate projection for sampling estimators
-
- Open
-
-
SERVER-112337 Use sampling infrastructure to compute NDVs of joining fields
-
- Needs Scheduling
-