Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-99631

histogramCE: sampleRate > 1 does not use $sample, so no perf improvement

    • Type: Icon: Improvement Improvement
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Query Optimization

      Specifying a sampleRate > 1 does not result in any noticeable performance improvement. The reason is that the command:

       db.foo.runCommand({analyze: "foo", key: "a", sampleRate: 0.001});

      Results in an internal pipeline that looks like this:

      {"$project":{"val":"$a"}},
      {"$group":{"_id":"a","statistics":{
          "$_internalConstructStats":{
              "val":"$$ROOT",
              "sampleRate":Double(0.001),
              "numberBuckets":100
          }
      }}}]

      Which is a COLLSCAN plus $group over the entire table.

      Instead, the internal pipeline should look like this:

      {$sample: {size: 0.001 * db.foo.estimatedDocumentCount()}},
      {"$project":{"val":"$a"}},
      {"$group":{"_id":"a","statistics":{
      "$_internalConstructStats":{
      "val":"$$ROOT",
      "sampleRate":Double(0.001),
      "numberBuckets":100
      }
      }}}]
       

      Which has a plan with MULTI_ITERATOR and sampleFromRandomCursor before the $project and $group . The performance is substantially faster.

            Assignee:
            Unassigned Unassigned
            Reporter:
            philip.stoev@mongodb.com Philip Stoev
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: