Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-24152

Add $bucketAuto stage

    • Fully Compatible
    • Query 17 (07/15/16), Query 18 (08/05/16)
    • 0

      Syntax

      {
        $bucketAuto: {
          groupBy: <arbitrary expression>,
          buckets: <number which is representable as 32-bit integer>,
          output: {  // Optional, defaults to {count: {$sum: 1}}
            fieldName1: <accumulator 1>,
            fieldName2: <accumulator 2>
          },
          granularity: <optional string - granularity spec>
        }
      }
      

      Supported Granularity Specs (based off of the wiki page on preferred numbers):

      • "R5"
      • "R10"
      • "R20"
      • "R40"
      • "R80"
      • "1-2-5"
      • "E6"
      • "E12"
      • "E24"
      • "E48"
      • "E96"
      • "E192"
      • "POWERSOF2" (1, 2, 4, 8, …)

      Examples

      //
      // Example 1 - all defaults.
      //
      
      > db.example.insert([
        {_id: 0, price: 34.4},
        {_id: 1, price: 20},
        {_id: 2, price: 0.45},
        {_id: 3, price: 999.99},
        ...
      ])
      > db.example.aggregate([{
        $bucketAuto: {
          groupBy: "$price",
          buckets: 4
        }
      }])
      // Output - note the weird ranges.
      {_id: {min: 0.01, max: 2.49}, count: 43}  // 0.01 was the lowest price in the catalog.
      {_id: {min: 2.49, max: 10.45}, count: 43}
      {_id: {min: 10.45, max: 54.99}, count: 43}
      {_id: {min: 54.99, max: 5499.99}, count: 43}  // 5499.99 was the highest price.
      
      //
      // Example 2 - custom accumulators.
      //
      
      // Same data.
      > db.example.aggregate([{
        $bucketAuto: {
          groupBy: "$price",
          buckets: 4,
          output: {
            count: {$sum: 1},
            avgPrice: {$avg: "$price"},
          }
        }
      }])
      // Output - still has the weird ranges.
      {_id: {min: 0.01, max: 2.49}, count: 43, avgPrice: 1.55}
      {_id: {min: 2.49, max: 10.45}, count: 43, avgPrice: 5.43}
      {_id: {min: 10.45, max: 54.99}, count: 43, avgPrice: 20.44}
      {_id: {min: 54.99, max: 5499.99}, count: 43, avgPrice: 78.97}
      
      
      //
      // Example 3 - custom granularity.
      //
      
      // Same data.
      > db.example.aggregate([{
        $bucketAuto: {
          groupBy: "$price",
          buckets: 4,
          output: {
            count: {$sum: 1},
            avgPrice: {$avg: "$price"},
          },
          granularity: "R5"
        }
      }])
      // Output
      {_id: {min: 0.01, max: 2.5}, count: 43, avgPrice: 1.55}
      {_id: {min: 2.5, max: 10}, count: 41, avgPrice: 5.46}
      {_id: {min: 10, max: 63}, count: 48, avgPrice: 20.24}
      {_id: {min: 63, max: 5499.99}, count: 41, avgPrice: 79.97}
      

      Behavior

      • This is a blocking stage.
      • The granularity will be used as follows:
        • Once the unrounded boundary values have been chosen, all values except the first and last will be replaced with their closest value from the specified series:
        • Example:
          0.01, 2.49, 10.45, 54.99, 5499.99 above will be replaced with
          0.01, 2.50, 10.00, 63.00, 5499.99 for "R5", or
          0.01, 2.00, 10.00, 50.00, 5499.99 for "1-2-5"
          (extra digits added to conserve alignment).
      • Will dynamically compute buckets to give you the desired number of buckets, each with an approximately equal number of documents in each. For example, if you asked for 2 buckets, you would get buckets representing [0th percentile, 50th percentile), [50th percentile, 100th percentile]
      • [Edge cases] (This is a draft, more research needs to be done)
        • In certain cases, there will not be as many buckets as requested.
        • In certain cases, the width of each bucket will not be uniform.
        • In certain cases, the 'height' (number of documents placed in each bucket) will not be uniform.
        • Take the following examples:
          • When requesting 2 buckets, with values {1,1,1,1,2}, you will get 2 buckets:
            {min: MinKey, max: 2}  // Includes only 1's - 4 total.
            {min: 2, max: MaxKey}  // Includes only 2's - 1 total.
            
          • When requesting 3 buckets, with values {1,2,2,2,2}, you will get 2 buckets:
            {min: MinKey, max: 2}  // Includes only 1's - 1 total.
            {min: 2, max: MaxKey}  // Includes only 2's - 4 total.
            
          • When requesting 2 buckets, with values {0,1,2,3,4,5,5,5,5} (9 total), you will get 2 buckets:
            {min: MinKey, max: 5}  // 5 total.
            {min: 5, max: MaxKey}  // 4 total.
            
          • When requesting 3 buckets, with values {0,1,2,3,4,5,6,7} (8 total), you will get 3 buckets:
            {min: MinKey, max: 3}  // 3 total.
            {min: 3, max: 6}       // 3 total.
            {min: 6, max: MaxKey}  // 2 total.
            
          • When requesting 3 buckets, with values {0,1,2,2,2,2,2,3} (8 total), you will get 3 buckets:
            {min: MinKey, max: 2}  // 2 total.
            {min: 2, max: 3}       // 5 total.
            {min: 3, max: MaxKey}  // 1 total.
            
          • When requesting 3 buckets, with values {0,1,2,2,2,3,3,3} (8 total), you will get 3 buckets:
            {min: MinKey, max: 2}  // 2 total.
            {min: 2, max: 3}       // 3 total.
            {min: 3, max: MaxKey}  // 3 total.
            
        • Guess at a reasonable implementation: The bounds of each bucket are computed by first computing the average bucket size, then filling buckets left to right (smallest 'min' to largest), placing values into the current bucket until the current bucket size is greater than or equal to the average bucket size, or until the next unique value has at least as many copies as the average bucket size.
          More research needs to be done here.

            Assignee:
            sally.mcnichols Sally McNichols
            Reporter:
            charlie.swanson@mongodb.com Charlie Swanson
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: