[SERVER-24152] Add $bucketAuto stage Created: 16/May/16  Updated: 14/Mar/17  Resolved: 05/Aug/16

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: 3.3.11

Type: New Feature Priority: Major - P3
Reporter: Charlie Swanson Assignee: Sally McNichols
Resolution: Done Votes: 0
Labels: stage
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Documented
is documented by DOCS-8829 Faceted Search - $bucketAuto Closed
Related
related to DRIVERS-297 Aggregation Framework Support for 3.4 Closed
Backwards Compatibility: Fully Compatible
Sprint: Query 17 (07/15/16), Query 18 (08/05/16)
Participants:
Linked BF Score: 0

 Description   

Syntax

{
  $bucketAuto: {
    groupBy: <arbitrary expression>,
    buckets: <number which is representable as 32-bit integer>,
    output: {  // Optional, defaults to {count: {$sum: 1}}
      fieldName1: <accumulator 1>,
      fieldName2: <accumulator 2>
    },
    granularity: <optional string - granularity spec>
  }
}

Supported Granularity Specs (based off of the wiki page on preferred numbers):

  • "R5"
  • "R10"
  • "R20"
  • "R40"
  • "R80"
  • "1-2-5"
  • "E6"
  • "E12"
  • "E24"
  • "E48"
  • "E96"
  • "E192"
  • "POWERSOF2" (1, 2, 4, 8, …)

Examples

//
// Example 1 - all defaults.
//
 
> db.example.insert([
  {_id: 0, price: 34.4},
  {_id: 1, price: 20},
  {_id: 2, price: 0.45},
  {_id: 3, price: 999.99},
  ...
])
> db.example.aggregate([{
  $bucketAuto: {
    groupBy: "$price",
    buckets: 4
  }
}])
// Output - note the weird ranges.
{_id: {min: 0.01, max: 2.49}, count: 43}  // 0.01 was the lowest price in the catalog.
{_id: {min: 2.49, max: 10.45}, count: 43}
{_id: {min: 10.45, max: 54.99}, count: 43}
{_id: {min: 54.99, max: 5499.99}, count: 43}  // 5499.99 was the highest price.
 
//
// Example 2 - custom accumulators.
//
 
// Same data.
> db.example.aggregate([{
  $bucketAuto: {
    groupBy: "$price",
    buckets: 4,
    output: {
      count: {$sum: 1},
      avgPrice: {$avg: "$price"},
    }
  }
}])
// Output - still has the weird ranges.
{_id: {min: 0.01, max: 2.49}, count: 43, avgPrice: 1.55}
{_id: {min: 2.49, max: 10.45}, count: 43, avgPrice: 5.43}
{_id: {min: 10.45, max: 54.99}, count: 43, avgPrice: 20.44}
{_id: {min: 54.99, max: 5499.99}, count: 43, avgPrice: 78.97}
 
 
//
// Example 3 - custom granularity.
//
 
// Same data.
> db.example.aggregate([{
  $bucketAuto: {
    groupBy: "$price",
    buckets: 4,
    output: {
      count: {$sum: 1},
      avgPrice: {$avg: "$price"},
    },
    granularity: "R5"
  }
}])
// Output
{_id: {min: 0.01, max: 2.5}, count: 43, avgPrice: 1.55}
{_id: {min: 2.5, max: 10}, count: 41, avgPrice: 5.46}
{_id: {min: 10, max: 63}, count: 48, avgPrice: 20.24}
{_id: {min: 63, max: 5499.99}, count: 41, avgPrice: 79.97}

Behavior

  • This is a blocking stage.
  • The granularity will be used as follows:
    • Once the unrounded boundary values have been chosen, all values except the first and last will be replaced with their closest value from the specified series:
    • Example:
      0.01, 2.49, 10.45, 54.99, 5499.99 above will be replaced with
      0.01, 2.50, 10.00, 63.00, 5499.99 for "R5", or
      0.01, 2.00, 10.00, 50.00, 5499.99 for "1-2-5"
      (extra digits added to conserve alignment).
  • Will dynamically compute buckets to give you the desired number of buckets, each with an approximately equal number of documents in each. For example, if you asked for 2 buckets, you would get buckets representing [0th percentile, 50th percentile), [50th percentile, 100th percentile]
  • [Edge cases] (This is a draft, more research needs to be done)
    • In certain cases, there will not be as many buckets as requested.
    • In certain cases, the width of each bucket will not be uniform.
    • In certain cases, the 'height' (number of documents placed in each bucket) will not be uniform.
    • Take the following examples:
      • When requesting 2 buckets, with values {1,1,1,1,2}, you will get 2 buckets:

        {min: MinKey, max: 2}  // Includes only 1's - 4 total.
        {min: 2, max: MaxKey}  // Includes only 2's - 1 total.
        

      • When requesting 3 buckets, with values {1,2,2,2,2}, you will get 2 buckets:

        {min: MinKey, max: 2}  // Includes only 1's - 1 total.
        {min: 2, max: MaxKey}  // Includes only 2's - 4 total.
        

      • When requesting 2 buckets, with values {0,1,2,3,4,5,5,5,5} (9 total), you will get 2 buckets:

        {min: MinKey, max: 5}  // 5 total.
        {min: 5, max: MaxKey}  // 4 total.
        

      • When requesting 3 buckets, with values {0,1,2,3,4,5,6,7} (8 total), you will get 3 buckets:

        {min: MinKey, max: 3}  // 3 total.
        {min: 3, max: 6}       // 3 total.
        {min: 6, max: MaxKey}  // 2 total.
        

      • When requesting 3 buckets, with values {0,1,2,2,2,2,2,3} (8 total), you will get 3 buckets:

        {min: MinKey, max: 2}  // 2 total.
        {min: 2, max: 3}       // 5 total.
        {min: 3, max: MaxKey}  // 1 total.
        

      • When requesting 3 buckets, with values {0,1,2,2,2,3,3,3} (8 total), you will get 3 buckets:

        {min: MinKey, max: 2}  // 2 total.
        {min: 2, max: 3}       // 3 total.
        {min: 3, max: MaxKey}  // 3 total.
        

    • Guess at a reasonable implementation: The bounds of each bucket are computed by first computing the average bucket size, then filling buckets left to right (smallest 'min' to largest), placing values into the current bucket until the current bucket size is greater than or equal to the average bucket size, or until the next unique value has at least as many copies as the average bucket size.
      More research needs to be done here.


 Comments   
Comment by Githook User [ 08/Aug/16 ]

Author:

{u'username': u'visemet', u'name': u'Max Hirschhorn', u'email': u'max.hirschhorn@mongodb.com'}

Message: SERVER-24152 Change error code to avoid duplicates.

Fixes compile with the RocksDB storage engine. The error code 40264 was
being used in rocks_index.cpp:

mongodb-partners/mongo-rocks@35c85d8c6d32c0eda01834493069b06f39803a67
Branch: master
https://github.com/mongodb/mongo/commit/0aa6850420e3617bd182c91f4a81c5021f59ee52

Comment by Githook User [ 05/Aug/16 ]

Author:

{u'username': u'visemet', u'name': u'Max Hirschhorn', u'email': u'max.hirschhorn@mongodb.com'}

Message: SERVER-24152 Define HAVE_LOG2 on Windows.

This is a temporary workaround for a stale js-confdefs.h now that we've
switched from VS2013 to VS2015.
Branch: master
https://github.com/mongodb/mongo/commit/30ca3621ae5ab06eb7a4f26a1db21f3d8c624fb1

Comment by Githook User [ 05/Aug/16 ]

Author:

{u'username': u'smcnichols', u'name': u'Sally McNichols', u'email': u'sally.mcnichols@mongodb.com'}

Message: SERVER-24152 add granularity option to $bucketAuto
Branch: master
https://github.com/mongodb/mongo/commit/91e499934c04443d98f88f850a37f9e341382b4b

Comment by Githook User [ 27/Jul/16 ]

Author:

{u'username': u'visemet', u'name': u'Max Hirschhorn', u'email': u'max.hirschhorn@mongodb.com'}

Message: SERVER-24152 Fix string comparision in SourceNameIsBucketAuto test case.

The compiler is able to deduplicate multiple copies of the same string
literal and cause them to refer to the same address. This can
potentially mask a failure when comparing their addresses instead of
their contents.
Branch: master
https://github.com/mongodb/mongo/commit/c15427b5d4b406019109d911f9691da078aeeed6

Comment by Githook User [ 26/Jul/16 ]

Author:

{u'username': u'smcnichols', u'name': u'Sally McNichols', u'email': u'sally.mcnichols@mongodb.com'}

Message: SERVER-24152 add $bucketAuto aggregation stage
Branch: master
https://github.com/mongodb/mongo/commit/39d63ea21d7236a88616e89cb8381b34414ac349

Comment by Sally McNichols [ 26/Jul/16 ]

Hi osmar.olivo,

I was referring to specifically disallowing the use of the granularity option when there are negative numbers or zero. Is that okay?

Comment by Osmar Olivo [ 26/Jul/16 ]

Do you mean specifically disallow the use of the granularity option when there are negative numbers? Or disallow the use of negative numbers with $bucketAuto altogether?

I'm OK with the first one (at least for a v1), but not the second one.

Comment by Charlie Swanson [ 25/Jul/16 ]

osmar.olivo, can you comment on the above? Is it okay to disallow negatives (at least for now)?

Comment by Sally McNichols [ 25/Jul/16 ]

Does the granularity option support rounding negative numbers (and zero)? Some of the preferred number series (like Renard numbers, 1-2-5 series, or E series) aren't well defined for rounding nonpositive numbers.

Generated at Thu Feb 08 04:05:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.