[SERVER-4929] aggregation: $group should have a $median accumulator Created: 10/Feb/12  Updated: 08/Nov/23

Status: Backlog
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: None

Type: New Feature Priority: Major - P3
Reporter: Daniel Pasette (Inactive) Assignee: Backlog - Query Optimization
Resolution: Unresolved Votes: 31
Labels: accumulator, expression, pm1457-nominee
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
is duplicated by SERVER-4571 Create $median aggregation feature Closed
Related
is related to SERVER-447 new aggregation framework Closed
Assigned Teams:
Query Optimization
Participants:

 Description   

This has been suggested a couple of times through forums, meetups, and office hours. $median would go with $avg, $min, $max, etc, as a $group expression.



 Comments   
Comment by Ilan Cohen [ 08/Nov/23 ]

MongoDB Version 7 has a $median accumulator which is supported in $group

backlog-query-optimization

Comment by Matteo Spreafico [ 03/Dec/21 ]

Would be very useful (and will also remove an issue opened almost ten years ago...)

Comment by Mathias Stearn [ 30/Oct/19 ]

charlie.swanson sorry for the delay in replying, I just came across this again today. If we are OK with giving fast but imprecise results (which is somewhat implied if you are using $sample), there are algorithms for on-line streaming computation of approximate percentiles and medians. The one I'm familiar with is  P-squared which uses some dynamically moving markers and counts. There is a boost implementation and we even had one in our codebase way back when, but it was removed since the feature it was built for was never productionized. We would need to modify the algorithm to provide some way to merge results from multiple shards, but IIRC that should be possible. There may also be newer algorithms that are designed with that in mind.

Comment by Massimo Brignoli [ 12/Feb/18 ]

Someone had the chance to have a look at tdigest? Seems quite compatible with Aggregation Framework and sharding...

https://github.com/CamDavidsonPilon/tdigest

Comment by Charlie Swanson [ 23/Feb/16 ]

It's subtle, but $percentile is actually very different from just another accumulator. Computing the Xth percentile requires looking at all the inputs (and sorting them?) before knowing the answer. This makes it just as memory-intensive as $push. It would also introduce some weird behavior in a sharded cluster. Since a $group would be split to run half on the shards, and half on the merging shard, the shards would have to send over their accumulated data. Everything goes over the wire as BSON, so the intermediate accumulated data could be at most 16MB. It would be surprising for some users to hit an excessive memory error on something that produces a pretty small output.

As far as I can tell, $median is just a special case of $percentile, computing the 50th percentile.

There may be an easy answer, which is just to introduce a $percentile expression, which operates over an array. Then if you wanted to compute the median, you could do the following:

db.example.aggregate([
  {$group: {_id: "$group_id", valuesOfInterest: {$push: "$valueOfInterest"}}},
  {$project: {_id: 1, median: {$percentile: ["$valuesOfInterest", 50]}}}
])

This would make it explicit that you have to store all the data to compute the percentile. The downside is that there's probably a way to compute percentile as an accumulator that would use a smaller memory footprint (e.g. if there are a lot of duplicate values) that this would miss out on.

redbeard0531, do you have any thoughts?

Comment by Virendra Agarwal [ 16/Sep/15 ]

It will be good if Median comes with percentile option for Data.

Comment by Chris Westin [ 16/Mar/12 ]

Moved to "Planning Bucket A" to match the disposition of the duplicate SERVER-4571.

Generated at Thu Feb 08 03:07:22 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.