Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-4929

aggregation: $group should have a $median accumulator

    Details

      Description

      This has been suggested a couple of times through forums, meetups, and office hours. $median would go with $avg, $min, $max, etc, as a $group expression.

        Issue Links

          Activity

          Hide
          cwestin Chris Westin added a comment -

          Moved to "Planning Bucket A" to match the disposition of the duplicate SERVER-4571.

          Show
          cwestin Chris Westin added a comment - Moved to "Planning Bucket A" to match the disposition of the duplicate SERVER-4571 .
          Hide
          virendra.agarwal@timesinternet.in Virendra Agarwal added a comment -

          It will be good if Median comes with percentile option for Data.

          Show
          virendra.agarwal@timesinternet.in Virendra Agarwal added a comment - It will be good if Median comes with percentile option for Data.
          Hide
          charlie.swanson Charlie Swanson added a comment -

          It's subtle, but $percentile is actually very different from just another accumulator. Computing the Xth percentile requires looking at all the inputs (and sorting them?) before knowing the answer. This makes it just as memory-intensive as $push. It would also introduce some weird behavior in a sharded cluster. Since a $group would be split to run half on the shards, and half on the merging shard, the shards would have to send over their accumulated data. Everything goes over the wire as BSON, so the intermediate accumulated data could be at most 16MB. It would be surprising for some users to hit an excessive memory error on something that produces a pretty small output.

          As far as I can tell, $median is just a special case of $percentile, computing the 50th percentile.

          There may be an easy answer, which is just to introduce a $percentile expression, which operates over an array. Then if you wanted to compute the median, you could do the following:

          db.example.aggregate([
            {$group: {_id: "$group_id", valuesOfInterest: {$push: "$valueOfInterest"}}},
            {$project: {_id: 1, median: {$percentile: ["$valuesOfInterest", 50]}}}
          ])
          

          This would make it explicit that you have to store all the data to compute the percentile. The downside is that there's probably a way to compute percentile as an accumulator that would use a smaller memory footprint (e.g. if there are a lot of duplicate values) that this would miss out on.

          Mathias Stearn, do you have any thoughts?

          Show
          charlie.swanson Charlie Swanson added a comment - It's subtle, but $percentile is actually very different from just another accumulator. Computing the Xth percentile requires looking at all the inputs (and sorting them?) before knowing the answer. This makes it just as memory-intensive as $push. It would also introduce some weird behavior in a sharded cluster. Since a $group would be split to run half on the shards, and half on the merging shard, the shards would have to send over their accumulated data. Everything goes over the wire as BSON, so the intermediate accumulated data could be at most 16MB. It would be surprising for some users to hit an excessive memory error on something that produces a pretty small output. As far as I can tell, $median is just a special case of $percentile , computing the 50th percentile. There may be an easy answer, which is just to introduce a $percentile expression, which operates over an array. Then if you wanted to compute the median, you could do the following: db.example.aggregate([ {$group: {_id: "$group_id" , valuesOfInterest: {$push: "$valueOfInterest" }}}, {$project: {_id: 1, median: {$percentile: [ "$valuesOfInterest" , 50]}}} ]) This would make it explicit that you have to store all the data to compute the percentile. The downside is that there's probably a way to compute percentile as an accumulator that would use a smaller memory footprint (e.g. if there are a lot of duplicate values) that this would miss out on. Mathias Stearn , do you have any thoughts?

            People

            • Votes:
              15 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

              • Created:
                Updated:
                Days since reply:
                40 weeks, 5 days ago
                Date of 1st Reply: