Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-24152

Add $bucketAuto stage

    XMLWordPrintable

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.3.11
    • Component/s: Aggregation Framework
    • Labels:
    • Backwards Compatibility:
      Fully Compatible
    • Epic Link:
    • Sprint:
      Query 17 (07/15/16), Query 18 (08/05/16)
    • Linked BF Score:
      0

      Description

      Syntax

      {
        $bucketAuto: {
          groupBy: <arbitrary expression>,
          buckets: <number which is representable as 32-bit integer>,
          output: {  // Optional, defaults to {count: {$sum: 1}}
            fieldName1: <accumulator 1>,
            fieldName2: <accumulator 2>
          },
          granularity: <optional string - granularity spec>
        }
      }
      

      Supported Granularity Specs (based off of the wiki page on preferred numbers):

      • "R5"
      • "R10"
      • "R20"
      • "R40"
      • "R80"
      • "1-2-5"
      • "E6"
      • "E12"
      • "E24"
      • "E48"
      • "E96"
      • "E192"
      • "POWERSOF2" (1, 2, 4, 8, …)

      Examples

      //
      // Example 1 - all defaults.
      //
       
      > db.example.insert([
        {_id: 0, price: 34.4},
        {_id: 1, price: 20},
        {_id: 2, price: 0.45},
        {_id: 3, price: 999.99},
        ...
      ])
      > db.example.aggregate([{
        $bucketAuto: {
          groupBy: "$price",
          buckets: 4
        }
      }])
      // Output - note the weird ranges.
      {_id: {min: 0.01, max: 2.49}, count: 43}  // 0.01 was the lowest price in the catalog.
      {_id: {min: 2.49, max: 10.45}, count: 43}
      {_id: {min: 10.45, max: 54.99}, count: 43}
      {_id: {min: 54.99, max: 5499.99}, count: 43}  // 5499.99 was the highest price.
       
      //
      // Example 2 - custom accumulators.
      //
       
      // Same data.
      > db.example.aggregate([{
        $bucketAuto: {
          groupBy: "$price",
          buckets: 4,
          output: {
            count: {$sum: 1},
            avgPrice: {$avg: "$price"},
          }
        }
      }])
      // Output - still has the weird ranges.
      {_id: {min: 0.01, max: 2.49}, count: 43, avgPrice: 1.55}
      {_id: {min: 2.49, max: 10.45}, count: 43, avgPrice: 5.43}
      {_id: {min: 10.45, max: 54.99}, count: 43, avgPrice: 20.44}
      {_id: {min: 54.99, max: 5499.99}, count: 43, avgPrice: 78.97}
       
       
      //
      // Example 3 - custom granularity.
      //
       
      // Same data.
      > db.example.aggregate([{
        $bucketAuto: {
          groupBy: "$price",
          buckets: 4,
          output: {
            count: {$sum: 1},
            avgPrice: {$avg: "$price"},
          },
          granularity: "R5"
        }
      }])
      // Output
      {_id: {min: 0.01, max: 2.5}, count: 43, avgPrice: 1.55}
      {_id: {min: 2.5, max: 10}, count: 41, avgPrice: 5.46}
      {_id: {min: 10, max: 63}, count: 48, avgPrice: 20.24}
      {_id: {min: 63, max: 5499.99}, count: 41, avgPrice: 79.97}
      

      Behavior

      • This is a blocking stage.
      • The granularity will be used as follows:
        • Once the unrounded boundary values have been chosen, all values except the first and last will be replaced with their closest value from the specified series:
        • Example:
          0.01, 2.49, 10.45, 54.99, 5499.99 above will be replaced with
          0.01, 2.50, 10.00, 63.00, 5499.99 for "R5", or
          0.01, 2.00, 10.00, 50.00, 5499.99 for "1-2-5"
          (extra digits added to conserve alignment).
      • Will dynamically compute buckets to give you the desired number of buckets, each with an approximately equal number of documents in each. For example, if you asked for 2 buckets, you would get buckets representing [0th percentile, 50th percentile), [50th percentile, 100th percentile]
      • [Edge cases] (This is a draft, more research needs to be done)
        • In certain cases, there will not be as many buckets as requested.
        • In certain cases, the width of each bucket will not be uniform.
        • In certain cases, the 'height' (number of documents placed in each bucket) will not be uniform.
        • Take the following examples:
          • When requesting 2 buckets, with values {1,1,1,1,2}, you will get 2 buckets:

            {min: MinKey, max: 2}  // Includes only 1's - 4 total.
            {min: 2, max: MaxKey}  // Includes only 2's - 1 total.
            

          • When requesting 3 buckets, with values {1,2,2,2,2}, you will get 2 buckets:

            {min: MinKey, max: 2}  // Includes only 1's - 1 total.
            {min: 2, max: MaxKey}  // Includes only 2's - 4 total.
            

          • When requesting 2 buckets, with values {0,1,2,3,4,5,5,5,5} (9 total), you will get 2 buckets:

            {min: MinKey, max: 5}  // 5 total.
            {min: 5, max: MaxKey}  // 4 total.
            

          • When requesting 3 buckets, with values {0,1,2,3,4,5,6,7} (8 total), you will get 3 buckets:

            {min: MinKey, max: 3}  // 3 total.
            {min: 3, max: 6}       // 3 total.
            {min: 6, max: MaxKey}  // 2 total.
            

          • When requesting 3 buckets, with values {0,1,2,2,2,2,2,3} (8 total), you will get 3 buckets:

            {min: MinKey, max: 2}  // 2 total.
            {min: 2, max: 3}       // 5 total.
            {min: 3, max: MaxKey}  // 1 total.
            

          • When requesting 3 buckets, with values {0,1,2,2,2,3,3,3} (8 total), you will get 3 buckets:

            {min: MinKey, max: 2}  // 2 total.
            {min: 2, max: 3}       // 3 total.
            {min: 3, max: MaxKey}  // 3 total.
            

        • Guess at a reasonable implementation: The bounds of each bucket are computed by first computing the average bucket size, then filling buckets left to right (smallest 'min' to largest), placing values into the current bucket until the current bucket size is greater than or equal to the average bucket size, or until the next unique value has at least as many copies as the average bucket size.
          More research needs to be done here.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              sally.mcnichols Sally McNichols
              Reporter:
              charlie.swanson Charlie Swanson
              Participants:
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: