-
Type: Bug
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Query Optimization
-
ALL
If the NDV becomes more than the numberBuckets, the histogram becomes very degenerate. E.g. for numberBuckets = 10 , all the distinct values beyond the first 9 that comprise almost all of the rows in the dataset are put together in the last bucket. This leads to very skewed estimates. The histogram is neither equal-width nor equal-depth and does not seem to minimize the estimation error in any discernible way.
Value N is present in the table N times:
db.foo.drop(); let docs = []; for (let q = 0; q < 1000; q++) { for (let i = 0; i < q; i++) { docs.push({a: q}); } } db.foo.insert(docs); db.foo.runCommand({analyze: "foo", key: "a", numberBuckets: 5});
This produces the following bounds:
bounds: [ 1, 2, 4, 5, 999 ]
All the values ~6 to ~999, present in 498486 documents get a single bucket. The rest of the histogram is occupied by values 1 to 5, present in 15 documents only. So we have the following wildly inaccurate estimates:
Enterprise test> db.foo.find({a:7}).count(); 7 Enterprise test> db.foo.find({a:7}).explain().queryPlanner.winningPlan.cardinalityEstimate; 502
- related to
-
SERVER-99093 histogramCE: Estimate for inequality is zero if NDV > numberBuckets
- Open