[SERVER-9377] Allow collecting "top" N values for each group Created: 17/Apr/13  Updated: 16/Sep/22  Resolved: 06/Jan/22

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: 5.2.0

Type: Improvement Priority: Major - P3
Reporter: Asya Kamsky Assignee: Katya Kamenieva
Resolution: Done Votes: 77
Labels: accumulator, expression
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
is duplicated by SERVER-10205 Group and select the top K elements i... Closed
is duplicated by SERVER-16989 Add $group accumulation operators $fi... Closed
is duplicated by SERVER-43928 Allow group aggregations to limit col... Closed
is duplicated by SERVER-7618 New aggregation expression: generator... Closed
Related
Backwards Compatibility: Fully Compatible
Sprint: QE 2021-10-18, QE 2021-11-29, QE 2021-12-13
Participants:
Case:

 Description   
Issue Status as of Jan 6, 2022

Summary

  • Accumulators that return an array of top/bottom n elements for each group according to the specified order

    {
       $topN: {
           n: <n expression>,
           sortBy: <sort spec>,
           output: <expression>
       }
    }
     
    {
       $bottomN: {
           n: <n expression>,
           sortBy: <sort spec>,
           output: <expression>
       }
    }
    

    To return one top/bottom element:

    {
       $top: {
           sortBy: <sort spec>,
           output: <expression>
       }
    }
     
    {
       $bottom: {
           sortBy: <sort spec>,
           output: <expression>
       }
    }
    

  • Accumulators that return first/last n elements for each group according to the current order of the documents:

    {
       $firstN: {
           n: <n expression>,
           input: <expression>
       }
    }
     
    {
       $lastN: {
           n: <n expression>,
           input: <expression>
       }
    }
    

  • Accumulators that return n highest/lowest values for each group

    {
       $maxN: {
           n: <n expression>,
           input: <expression>
       }
    }
     
    {
       $minN: {
           n: <n expression>,
           input: <expression>
       }
    }
    

  • $firstN/$lastN/$minN/$maxN are also available as aggregation expressions for array fields.

Examples

db.scores.insert([
 {user: "user1", game:"game_A", score: 2345},
 {user: "user2", game:"game_A", score: 100},
 {user: "user3", game:"game_A", score: 555},
 {user: "user4", game:"game_A", score: 53234},
 {user: "user1", game:"game_B", score: 65438},
 {user: "user2", game:"game_B", score: 865},
 {user: "user3", game:"game_B", score: 400},
 {user: "user4", game:"game_B", score: 9865}
]);
 
db.scores.aggregate([
  {
     $group:
     {
        _id: "$game",
        leaderboard:
           {
              $topN:
                 {
                    sortBy: {score: -1},
                    n: 3,
                    output: {userName: '$user', score: '$score'}
                 }
           }
     }
  }
])

Output

[{
 user: "game_A",
 leaderboard: [
   {userName: "user4", score: 53234},
   {userName: "user1", score: 2345},
   {userName: "user3", score: 555},
 ]
},
{
 user: "game_B",
 leaderboard: [
   {user: "user1", score: 65438},
   {user: "user4", score: 9865},
   {user: "user2", score: 865},
 ]
}]

db.scores.aggregate( [
  {
     $group:
     {
        _id: "$game",
        threeHighestScores:
           {
              $maxN:
                 {
                    input: "$score",
                    n: 3
                 }
           }
     }
  }
] )

Output:

[{
 _id: "game_A",
 threeHighestScores: [ 53234, 2345, 555]
},
{
 _id: "game_B",
 leaderboard: [
   threeHighestScores: [ 65438, 9865, 865]
}]

Versions
This feature will be available starting in version 5.2.0 once the upgrade (including upgrading the FCV) is fully complete.

Original Description
Title: Extend $push or $max to allow collecting "top" N values per _id key in $group phase

Syntax

{$pushFirstN: [<expression>, N]}
{$pushLastN: [<expression>, N]}
{$pushMinN: [<expression>, N]}
{$pushMaxN: [<expression>, N]}

Examples

> db.coll.insert([
  {_id: "user1", game:"A", score: 2345},
  {_id: "user3", game:"A", score: 100},
  {_id: "user4", game:"A", score: 555},
  {_id: "user7", game:"A", score: 53234}
]);
> db.coll.aggregate([
    {$sort: {score: -1}},
    {$group: {
        _id: "$game",
        leaderBoardUsers: {
            $pushFirstN: ["$_id", 10]
        }
    }
])
{_id: "A", leaderBoardUsers: ["user7", "user1", "user4", "user3"]}
 
 
// Example 2
> db.coll.insert([
  {_id: "user1", game:"A", score: 2345}
  {_id: "user3", game:"A", score: 100}
  {_id: "user4", game:"A", score: 555}
  {_id: "user7", game:"A", score: 53234}
]);
> db.coll.aggregate([
    {$sort: {score: -1}},
    {$group: {
        _id: "$game",
        leaderBoard: {$pushFirstN: [{_id: "$_id", score: "$score"}, 10]}
    }
])
{
  _id: "A",
  leaderBoard: [
    {_id: "user7", score: 53234},
    {_id: "user1", score: 2345},
    {_id: "user4", score: 555},
    {_id: "user3", score: 100}
  ]
}

Notes

  • This would be the first accumulator to take more than a single argument.

Errors

  • If the second argument is not a nonnegative integer. If the argument is zero, the result will be an empty array.

Previous Description:
Frequently asked by users:

Analogous to {$group:{_id:"$key", maxval:{$max:"$val"}}} if user needs to gather top N values per key (most recent, highest N, etc) to have ability to do equivalent of {$max:"$val",$limit:5} or $push:{$sort:...,$limit:N} type of idea.



 Comments   
Comment by Katya Kamenieva [ 06/Jan/22 ]

This will be included in the Atlas Rapid Release version 5.2.0 and will be available in the LTS version 6.0.

Comment by Asya Kamsky [ 02/Apr/20 ]

I just realized there is a workaround that can be used that depending on the number of keys being grouped could be much better than pushing everything into one array and then $slicing it.

Assuming a collection called scores which has an index on game:1, score:-1 and a typical documents looking like this:

{ "_id" : ObjectId("5e85f8912ec206363fe5868d"), "player" : "P0", "game" : "G900", "score" : 3477 }

To get a leaderboard for three specific games run this aggregation:

db.scores.aggregate([
     {$match:{game:{$in:["G0","G100","G500"]}}},
     {$group:{_id:"$game"}},
     {$lookup:{
          from:"scores", 
          as:"top5", 
          let:{g:"$_id"},
          pipeline:[ 
               {$match:{$expr:{$eq:["$game","$$g"]}}}, 
               {$sort:{score:-1}},
               {$limit:5}, 
               {$project:{_id:0, player:1, score:1}}
          ]
     }}
])
{ "_id" : "G0", "top5" : [ { "player" : "P17", "score" : 4146076 }, { "player" : "P17", "score" : 4146076 }, { "player" : "P17", "score" : 4146076 }, { "player" : "P8", "score" : 4133335 }, { "player" : "P8", "score" : 4133335 } ] }
{ "_id" : "G100", "top5" : [ { "player" : "P33", "score" : 4163087 }, { "player" : "P33", "score" : 4163087 }, { "player" : "P33", "score" : 4163087 }, { "player" : "P8", "score" : 4150937 }, { "player" : "P8", "score" : 4150937 } ] }
{ "_id" : "G500", "top5" : [ { "player" : "P39", "score" : 4166987 }, { "player" : "P39", "score" : 4166987 }, { "player" : "P39", "score" : 4166987 }, { "player" : "P18", "score" : 4156208 }, { "player" : "P18", "score" : 4156208 } ] }

To get the same for all the games just remove the $match

db.scores.aggregate({$group:{_id:"$game"}},{$lookup:{from:"scores", as:"top5", let:{g:"$_id"},pipeline:[ {$match:{$expr:{$eq:["$game","$$g"]}}}, {$sort:{score:-1}},{$limit:5}, {$project:{_id:0, player:1, score:1}} ]}})
{ "_id" : "G0", "top5" : [ { "player" : "P17", "score" : 4146076 }, { "player" : "P17", "score" : 4146076 }, { "player" : "P17", "score" : 4146076 }, { "player" : "P8", "score" : 4133335 }, { "player" : "P8", "score" : 4133335 } ] }
{ "_id" : "G100", "top5" : [ { "player" : "P33", "score" : 4163087 }, { "player" : "P33", "score" : 4163087 }, { "player" : "P33", "score" : 4163087 }, { "player" : "P8", "score" : 4150937 }, { "player" : "P8", "score" : 4150937 } ] }
{ "_id" : "G1000", "top5" : [ { "player" : "P283", "score" : 149991 }, { "player" : "P283", "score" : 149991 }, { "player" : "P283", "score" : 149991 }, { "player" : "P283", "score" : 149991 }, { "player" : "P431", "score" : 149947 } ] }
{ "_id" : "G1100", "top5" : [ { "player" : "P645", "score" : 149859 }, { "player" : "P645", "score" : 149859 }, { "player" : "P645", "score" : 149859 }, { "player" : "P645", "score" : 149859 }, { "player" : "P1338", "score" : 149805 } ] }
{ "_id" : "G1200", "top5" : [ { "player" : "P1310", "score" : 149958 }, { "player" : "P1310", "score" : 149958 }, { "player" : "P1310", "score" : 149958 }, { "player" : "P886", "score" : 149951 }, { "player" : "P886", "score" : 149951 } ] }
{ "_id" : "G1300", "top5" : [ { "player" : "P100", "score" : 149942 }, { "player" : "P100", "score" : 149942 }, { "player" : "P100", "score" : 149942 }, { "player" : "P100", "score" : 149942 }, { "player" : "P1241", "score" : 149856 } ] }
{ "_id" : "G1400", "top5" : [ { "player" : "P489", "score" : 149988 }, { "player" : "P489", "score" : 149988 }, { "player" : "P489", "score" : 149988 }, { "player" : "P489", "score" : 149988 }, { "player" : "P972", "score" : 149923 } ] }
{ "_id" : "G1500", "top5" : [ { "player" : "P303", "score" : 149927 }, { "player" : "P303", "score" : 149927 }, { "player" : "P303", "score" : 149927 }, { "player" : "P303", "score" : 149927 }, { "player" : "P128", "score" : 149872 } ] }
{ "_id" : "G1600", "top5" : [ { "player" : "P795", "score" : 149960 }, { "player" : "P795", "score" : 149960 }, { "player" : "P795", "score" : 149960 }, { "player" : "P766", "score" : 149695 }, { "player" : "P766", "score" : 149695 } ] }
{ "_id" : "G1700", "top5" : [ { "player" : "P107", "score" : 149981 }, { "player" : "P107", "score" : 149981 }, { "player" : "P107", "score" : 149981 }, { "player" : "P107", "score" : 149981 }, { "player" : "P1427", "score" : 149955 } ] }
{ "_id" : "G1800", "top5" : [ { "player" : "P774", "score" : 149886 }, { "player" : "P774", "score" : 149886 }, { "player" : "P774", "score" : 149886 }, { "player" : "P712", "score" : 149746 }, { "player" : "P712", "score" : 149746 } ] }
{ "_id" : "G1900", "top5" : [ { "player" : "P301", "score" : 149275 }, { "player" : "P301", "score" : 149275 }, { "player" : "P301", "score" : 149275 }, { "player" : "P301", "score" : 149275 }, { "player" : "P740", "score" : 149063 } ] }
{ "_id" : "G200", "top5" : [ { "player" : "P41", "score" : 4136307 }, { "player" : "P41", "score" : 4136307 }, { "player" : "P41", "score" : 4136307 }, { "player" : "P2", "score" : 4103652 }, { "player" : "P2", "score" : 4103652 } ] }
{ "_id" : "G2000", "top5" : [ { "player" : "P952", "score" : 149857 }, { "player" : "P952", "score" : 149857 }, { "player" : "P952", "score" : 149857 }, { "player" : "P1486", "score" : 149826 }, { "player" : "P1486", "score" : 149826 } ] }
{ "_id" : "G2100", "top5" : [ { "player" : "P977", "score" : 149936 }, { "player" : "P977", "score" : 149936 }, { "player" : "P977", "score" : 149936 }, { "player" : "P465", "score" : 149882 }, { "player" : "P465", "score" : 149882 } ] }
{ "_id" : "G2200", "top5" : [ { "player" : "P450", "score" : 149824 }, { "player" : "P450", "score" : 149824 }, { "player" : "P450", "score" : 149824 }, { "player" : "P450", "score" : 149824 }, { "player" : "P676", "score" : 149771 } ] }
{ "_id" : "G2300", "top5" : [ { "player" : "P444", "score" : 149826 }, { "player" : "P444", "score" : 149826 }, { "player" : "P444", "score" : 149826 }, { "player" : "P444", "score" : 149826 }, { "player" : "P173", "score" : 149805 } ] }
{ "_id" : "G2400", "top5" : [ { "player" : "P1006", "score" : 149885 }, { "player" : "P1006", "score" : 149885 }, { "player" : "P1006", "score" : 149885 }, { "player" : "P525", "score" : 149700 }, { "player" : "P525", "score" : 149700 } ] }
{ "_id" : "G2500", "top5" : [ { "player" : "P112", "score" : 149806 }, { "player" : "P112", "score" : 149806 }, { "player" : "P112", "score" : 149806 }, { "player" : "P112", "score" : 149806 }, { "player" : "P868", "score" : 149770 } ] }
{ "_id" : "G2600", "top5" : [ { "player" : "P1232", "score" : 149937 }, { "player" : "P1232", "score" : 149937 }, { "player" : "P1232", "score" : 149937 }, { "player" : "P977", "score" : 149690 }, { "player" : "P977", "score" : 149690 } ] }
...

In my test dataset which has 100 games, and 1.13 million total documents, the above aggregation for all games runs in 39ms, for a subset of games in single digits it runs in 2-3ms.

Comment by Asya Kamsky [ 31/Mar/20 ]

ashutosh.mimani@inspirock.com there are no obvious workaround that can do this in a single aggregation pipeline - I imagine this could be implemented using several queries/aggregations, but I haven't been able to come up with a general solution in a single pipeline.

Comment by Ashutosh Mimani [ 19/Mar/20 ]

I can't use $push followed by $slice as that hits 100mb memory limit. This accumulator would have been a great solution for the problem. Are there known workarounds to not having this? 

Comment by DANIELE Tassone [ 10/Mar/20 ]

Just come up on this issue today - I will really would be happy to have it.

Comment by Benoit Labergri [ 29/Jan/18 ]

I am also worried about performances.
Same question with million of reccords on sharded collection and If I need the full document content (1k data) of the 10 best score only. I assume it will do the full scan of the candidates and not only the 10 best , will it ?
Even if the query does only the scan of the 10 best per shard would be better than the scan of all matching the first stage.
The simplest would have been to do a lookup but it does not work with sharded collection...
For instance our solution for performance is do to a find with limit for each groups in parallel but we still have a hope that aggregate will improve.

Comment by Ashutosh Mittal [ 27/Jan/18 ]

but if the number of records is in millions, but i want only 10 records in each group is it gonna cause a problem?

Comment by Asya Kamsky [ 25/Jan/18 ]

ashuSvirus if the number of records of each grouping is not too high you can $sort and $push (without limit) and then use $slice expression in the next stage to trim the result to desired number of remaining (top) values.

Using example from description to get top 3 leaderboard per game:

> db.coll.insert([
  {_id: "user1", game:"A", score: 2345},
  {_id: "user3", game:"A", score: 100},
  {_id: "user4", game:"A", score: 555},
  {_id: "user7", game:"A", score: 53234}
]);
> db.coll.aggregate([
    {$sort: {game:1, score: -1}},
    {$group: {
        _id: "$game",
        leaderBoard: {$push: {_id: "$_id", score: "$score"}}
    }},
    {$addFields:{leaderBoard:{$slice:["$leaderBoard",0,3]}}}
])
{
  _id: "A",
  leaderBoard: [
    {_id: "user7", score: 53234},
    {_id: "user1", score: 2345},
    {_id: "user4", score: 555}
  ], {allowDiskUse:true}
}

While sorting on game isn't necessary, I'm making an assumption there may be an index to support that sort and it would also reduce resource usage during grouping.

Comment by Ashutosh Mittal [ 20/Jan/18 ]

is there any workaround for this one without performing n+1 loop?

Comment by Asya Kamsky [ 20/Dec/16 ]

We are still tracking this as a highly desired feature, but we have not yet decided what version it will be implemented in.

Comment by Sachin Takkar [ 13/Dec/16 ]

Please comment if there are any plans to do this in future.

Comment by Jagadish [ 15/May/15 ]

Are there any plans to do it? When will this feature be added?

Comment by Peter [ 30/Apr/14 ]

It's really helpful for that guys who transfer from Relationship-DBMS.
Although it's possible to achieve the goal by Map/Reduce. But as I kown, Map/Reduce in MonogDB has a bad efficiency,not like in CouchDB.

Comment by v s [ 14/Apr/14 ]

+1

Comment by Kamesh [ 05/Dec/13 ]

Any idea on, in which version this will be fixed?

Comment by Andy Ennamorato [ 25/Jul/13 ]

This would definitely be useful to have.

Comment by Flavien [ 24/Jul/13 ]

Is there an ETA for triaging this? Our planning depends on whether and when this gets implemented.

Comment by Flavien [ 24/Jul/13 ]

I would very much like that, right now I have to issue 60 different queries to get the top 10 of various groups, when only one would be needed if we had this feature.

Generated at Thu Feb 08 03:20:13 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.