Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-99629

histogramCE: Degenerate histogram if NDV > numberBuckets

    • Type: Icon: Bug Bug
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Query Optimization
    • ALL

      If the NDV becomes more than the numberBuckets, the histogram becomes very degenerate. E.g. for numberBuckets = 10 , all the distinct values beyond the first 9 that comprise almost all of the rows in the dataset are put together in the last bucket. This leads to very skewed estimates. The histogram is neither equal-width nor equal-depth and does not seem to minimize the estimation error in any discernible way.

      Value N is present in the table N times:

      db.foo.drop();
      let docs = [];
      for (let q = 0; q < 1000; q++) {
          for (let i = 0; i < q; i++) {
              docs.push({a: q});
          }
      }
      db.foo.insert(docs);
      db.foo.runCommand({analyze: "foo", key: "a", numberBuckets: 5});

      This produces the following bounds:

      bounds: [ 1, 2, 4, 5, 999 ] 

      All the values ~6 to ~999, present in 498486 documents get a single bucket. The rest of the histogram is occupied by values 1 to 5, present in 15 documents only. So we have the following wildly inaccurate estimates:

      Enterprise test> db.foo.find({a:7}).count();
      7
      Enterprise test> db.foo.find({a:7}).explain().queryPlanner.winningPlan.cardinalityEstimate;
      502 

            Assignee:
            Unassigned Unassigned
            Reporter:
            philip.stoev@mongodb.com Philip Stoev
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: