[SERVER-24799] $group aggregation command should maintain document order Created: 24/Jun/16  Updated: 06/Dec/22  Resolved: 17/Jan/18

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Billy Tetrud Assignee: Backlog - Query Team (Inactive)
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-41891 $group on sorted cursor forces full c... Closed
is duplicated by SERVER-43029 sort before group, without changing k... Closed
is duplicated by SERVER-45333 Why we can not push not push the data... Closed
Assigned Teams:
Query
Backwards Compatibility: Fully Compatible
Participants:

 Description   

I just wasted a ton of time before realizing that $group doesn't group by the order of input documents. This is incredibly counter-intuitive and makes the already-complicated aggregation more complex to understand. I don't see why $group shouldn't guarantee input order, and it would be very helpful if it did.



 Comments   
Comment by Wojciech Waga [ 18/Oct/21 ]

@Asya Kamsky

I just went into the same issue and although being intuitive or not is subjective we are running into objective problem as well - by ignoring data being ordered by the same key (with proper index and sort on that key) group becomes a non-streaming op which makes it subject to 100MB limit and gives worse experience and perrofmance in some cases.

Comment by Asya Kamsky [ 09/Jul/20 ]

Adding a note here since this ticket was linked as example of something that's a misunderstanding:

$group always honors the order the documents are arriving into the $group stage, so if you $sort and then $group with $first, $last, $push etc then the order very much matters and will make a difference to each document within $group.

What is never guaranteed (at least at this point) is that after the group stage the entire stream of documents would be ordered by "something". So if you want it ordered by group _id field, you have to sort after $group. If you want it sorted by some other field, you have to $sort by that after $group.

Comment by Asya Kamsky [ 17/Jan/18 ]

We haven't been able to determine a use case that requires new functionality so I'm closing this as Won't Fix.

If you think that's a mistake, please provide an example with input documents, pipeline and desired output so we can see how the requested behavior would work.

Comment by Asya Kamsky [ 24/Oct/16 ]

uji

An array can be unwound emitting index of each element along with value.

Starting with 3.4 arrays can also be $zip'ed together with a generated sequence.

It's not really clear to me why the "roads" example needs $group to maintain document order as you didn't describe what objective $group is fulfilling.

This type of MongoDB-related support discussion would be best posted on the mongodb-user group.

Asya

Comment by Andrea Di Cesare [ 24/Oct/16 ]

I have a use case where maintaining the order of inputs document would make a big difference

Suppose the collection "roads" whose elements refer to the elements of the collection "points". It is a 1:N relationship implemented via the array property "pointIds" on the collection "roads". This because the points of the roads are ordered (from the beginning to the end of the road).

Just to clarify, the ordering of the "points" is purely maintained by the array, i.e. there are no properties of "points" that allow to deduct their order.

An aggregation pipeline on roads that uses $lookup to retrieve the list of points cannot use $group because it would lose the information about the ordering.

Comment by Billy Tetrud [ 27/Sep/16 ]

Asya, that's super helpful info. Thanks!

Comment by Asya Kamsky [ 27/Sep/16 ]

fresheneesz you can do this in upcoming 3.4 without any unwind and group using the new $reduce expression. This could be the $project stage to extract the field by which you want to sort via $sort:{"sortDate":1}:

{$project:{
    history:1, 
    sortDate: { $reduce: { 
                           input:"$history", 
                           initialValue:0, 
                           in: { $cond: { 
                                      if: {$eq:["$$this.field","b"]},
                                      then:  {$cond:{ 
                                              if:{$gt:["$$this.date","$$value"]},
                                              then:"$$this.date",
                                              else:"$$value"}},
                                      else:"$$value"
                           }}
    }}
}}

In your sample document it creates new field "sortDate" which is the largest "date" value in the array element where field is "b".

You can do it in the current 3.2 version via $filter and $max operators in $project stage as well and then sort by "sortDate.date":

{$project:{
    history:1, 
    sortDate:{ $max: {$filter: {
                                     input:"$history", 
                                     as:"this", 
                                     cond:{$eq:["$$this.field","b"]}
    }}}
}}

Comment by Billy Tetrud [ 05/Jul/16 ]

Is that discussion somewhere I can view? Why wouldn't my requested behavior make sense?

"if your application requires some ordering of documents, you would specify that somehow, e.g. via a $sort"

Yes, and if you $sort something then use a $group command, the sort is invalidated. This doesn't make sense, as the output of all the other commands are respected in the aggregation pipeline.

"Could you accomplish the ordering you need by adding a $sort stage after the $group?"

No. Here's my use case. I have documents with nested object representing its history like this:

var document = {
  history: [
    {field:'a', date:142983472},
    {field:'b', date:142983473},
    {field:'c', date:142983474},
    {field:'b', date:142983475},
    {field:'b', date:142983476},
  ]
}

And I want to sort the documents by the last modified date of a certain field. So lets say that field is b, I want to sort by the date 142983476. In order to do this, I need to unwind the history, project it in a certain way to expose only the dates I want, then sort by those date, then group by _id. But if grouping doesn't respect sort order, this won't work.

As a work around, I needed to build a much more complicated (and expensive) first projection stage, then after grouping I need an extra projection stage, then I can sort as the last stage. This was much harder to reason about and is more expensive to run.

Comment by Charlie Swanson [ 05/Jul/16 ]

fresheneesz, after some discussion, we're not sure the notion of ordering by the sequence of input documents makes much sense. Generally, if your application requires some ordering of documents, you would specify that somehow, e.g. via a $sort. Can you elaborate on exactly why this behavior would be useful? Could you accomplish the ordering you need by adding a $sort stage after the $group?

I'll note that this may be related to SERVER-4507, which will in some cases allow the $group stage to take advantage of sorted sequences, preserving the order.

Comment by Billy Tetrud [ 28/Jun/16 ]

I suppose I've been working with Javascript too long and assumed that mongo mapping would work like Javascript mapping. But of course MongoDB is written in C.. Anyways, yes that's exactly what I mean. But the overhead should be pretty insignificant I would think.

Comment by Charlie Swanson [ 27/Jun/16 ]

Hi fresheneesz,

Just to clarify, you're looking for the order of output documents to match the order of input documents? The reason the order is not preserved is that we keep track of all groups with a map from the _id to the accumulated value, so that we can quickly look up which group an incoming document belongs to.

For exemple, with the group stage

{$group: {_id: {$toLower: "$x"}, count: {$sum: 1}}}

And inputs given in this order:

{x: "foo"}
{x: "bar"}
{x: "BAR"}
{x: "FOO"}

You would expect the following output?

{_id: "foo", count: 2}
{_id: "bar", count: 2}

This is definitely possible, but will require some extra overhead to keep track of the order of the groups.

Generated at Thu Feb 08 04:07:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.