[SERVER-21022] Coalesce $group and $limit in aggregation pipeline Created: 20/Oct/15  Updated: 18/Aug/20  Resolved: 29/Mar/16

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: 2.6.11
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Michael Korbakov Assignee: Benjamin Murphy
Resolution: Won't Fix Votes: 0
Labels: optimization
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-4507 aggregation: optimize $group to take... Backlog
Backwards Compatibility: Fully Compatible
Sprint: Query 10 (02/22/16), Query 11 (03/14/16), Query 12 (04/04/16)
Participants:

 Description   

I wonder is it possible to add coalescing optimization to a sequence of $group and $limit in aggregation pipeline?
For example, I have a pipeline like this:

db.messages.aggregate([
    {$match: {"company_id" : ObjectId("4c2118ad54397f271b000000")}},
    {$sort: {"ct": -1}},
    {$group: {_id: "$ti", ts: {$first: "$ct"}}},
    {$limit: 10}
])

My goal is to get small subset of $ti ("thread id") that have latest $ct ("conversation timestamp"). Execution time for this request is prohibitively high and I guess it's because $group stage process all input documents and apply limit afterwards.

It seems that incorporating $limit into $group processing would benefit this query by avoiding computations that will be thrown away in any case.



 Comments   
Comment by yoch melka [ 03/Jan/20 ]

Thank you Asya.

I already voted for this ticket. My point here is to provide a realistic example which greatly benefit from such an improvement. In fact, mongodb currently performs far worse than MySQL/PostgreSQL on this type of case (pagination after grouping), which can be pretty common.

I noticed that sorting before grouping actually doesn't give any garantee about the results order, as explained in SERVER-24799. But IMHO, if you are going to implement the 4507 proposal, it will be really beneficial to support also the case that sorting order is not given by the index but by previous sort stage in pipeline, and for that you'll want to implement the 24799 anyway.

Comment by Asya Kamsky [ 02/Jan/20 ]

Your pipeline would benefit from SERVER-4507, please vote/watch that ticket.

Also note that pre-$group sort on productId does not guarantee any particular order after $group so using skip/limit on it with explicitly sorting is not supported (it may work by coincidence now, but there is no such guarantee).

Comment by yoch melka [ 18/Dec/19 ]

Hi,

We have an use case pretty similar to the mhertz one, and this is a real concern for us.

Our documents are like that (there is an unique index on {productId: 1, vendorId: 1}) :

{
  "_id": <ObjectID>,
  "productId": <ObjectID>,
  "vendorId": <ObjectID>,
  "price": <decimal>,
  "quantity": <int>,
}

The aggregation pipeline is basically :

[
  { $sort: { productId: 1 } },
  { 
    $group: {
      _id: '$productId',
      vendors: {
        $push: '$vendorId'
      }
    }
  },
  { $skip: n},
  { $limit: k }
]

In fact, the general use case of "pagination after grouping" is currently problematic, because the group stage need to pass over the whole collection before the later stages applies, consuming much more time and memory than needed. (For instance, see that question on Stackoverflow.)

Such a problem occurs also with $lookup from other collections (e.g. when we lookup from the products collection): it's not possible to paginate effectively. Is there a plan to making $lookup streamed ?

 

Comment by Asya Kamsky [ 04/Jul/19 ]

I think SERVER-4507 will fix/eliminate this issue as streaming $group will immediately start returning results and as soon $limit is satisfied it will be able to stop.

 

Comment by Asya Kamsky [ 01/Jul/19 ]

I meant that if you just want any 10 names, you can pick them arbitrarily (from whatever list or distinct command) and then start pipeline with

{$match:{name:{$in:[ list-of-10-names ]}}} 

Comment by Matthew Hertz [ 01/Jul/19 ]

If you don't care, you could use $match first thing in the pipeline to only keep some 10 names.

Sometimes we do want a specific set of names but a lot of the time we don't. I don't quite understand how match helps in the case of us not caring which names are returned though - could you please elaborate more on that? Obviously if we know what names we want we can just filter for those names in the match stage.

I'm very happy to see that change is coming in 4.1.4. That'll be a big help to us.

Comment by Asya Kamsky [ 30/Jun/19 ]

mhertz I have a couple of clarifying questions about your use case.

I presume you have a lot more than 10 names in your dataset, yes?  Do you care which ones are the 10 that you get to keep from $group?  If you don't care, you could use $match first thing in the pipeline to only keep some 10 names.

 

The second issue you mention is actually addressed in 4.2 via SERVER-9507 so when I do the pipeline you show I get a much more efficient plan:

db.coll.aggregate([     {       $sort: {"name": 1, "start_date": -1}}, {       $group: {"_id": "$name", first:{"$first": "$_id"}}     },     {       $limit: 5     }])
{
	"stages" : [
		{
			"$cursor" : {
				"query" : {
 
				},
				"sort" : {
					"name" : 1,
					"start_date" : -1
				},
				"fields" : {
					"name" : 1,
					"start_date" : 1,
					"_id" : 1
				},
				"queryPlanner" : {
					"plannerVersion" : 1,
					"namespace" : "test.sort",
					"indexFilterSet" : false,
					"parsedQuery" : {
 
					},
					"queryHash" : "E94DC09D",
					"planCacheKey" : "E94DC09D",
					"winningPlan" : {
						"stage" : "FETCH",
						"inputStage" : {
							"stage" : "DISTINCT_SCAN",
							"keyPattern" : {
								"name" : 1,
								"start_date" : -1
							},
							"indexName" : "name_1_start_date_-1",
							"isMultiKey" : false,
							"multiKeyPaths" : {
								"name" : [ ],
								"start_date" : [ ]
							},
							"isUnique" : false,
							"isSparse" : false,
							"isPartial" : false,
							"indexVersion" : 2,
							"direction" : "forward",
							"indexBounds" : {
								"name" : [
									"[MinKey, MaxKey]"
								],
								"start_date" : [
									"[MaxKey, MinKey]"
								]
							}
						}
					},
					"rejectedPlans" : [ ]
				}
			}
		},
		{
			"$groupByDistinctScan" : {
				"newRoot" : {
					"_id" : "$name",
					"first" : "$_id"
				}
			}
		},
		{
			"$limit" : NumberLong(10)
		}
	],
	"ok" : 1
}
// logs show
planSummary: DISTINCT_SCAN { name: 1, start_date: -1 } keysExamined:10 docsExamined:10
 

In my case I have 10 distinct names so that's how many keys and documents the aggregation looks at.

Comment by Matthew Hertz [ 20/Jun/19 ]

I have a use case. Our application actually depends on being able to do this - and we can't! Guess we should have investigated this before settling on Mongo!

We have a number of collections all with documents of the following form:

{
  "_id": <ObjectID>,
  "name": "blah1",
  "start_date": T1,
  "payload": <embedded document>
}

We're sort of defining a number of time series each individually addressable by name. We very often need to select only the latest document (i.e. document with the greatest start_date per name) for a number of time series (the first n time series ordered by name is fine). What we want to do is this:

collection.aggregate(
  [
    {
      $sort: {"name": 1, "start_date": -1} (there's an index on this, obviously!)
    },
    {
      $group: {"_id": "$name", "$first": "$_id"}
    },
    {
      $limit: 10
    }
  ]
)

In this case Mongo calculates all the groups, looking at every document in the entire collection, despite me stating up front I only want 10. That's clearly more work than it has to do given the query, and it makes me very sad! There's another issue (I suspect, anyway) which is that even in the first 10 groups Mongo inspects every document/key in each group despite all the data in the outputted group document being contained entirely in the first document in the group. There should be no need to look at further documents/keys per group!

Regardless of the latter issue, the first issue to me is an example of a clear use case for coalescing group and limit. 

Comment by Benjamin Murphy [ 29/Mar/16 ]

After attempting to implement this feature, we ran into some unexpected issues related to spilling to disk. We also have been unable to come up with a reasonable use case for $group followed immediately with $limit. As a result, I'm closing this ticket as "Won't Fix," since the potential optimization does not seem to outweigh the code complexity.

Comment by Charlie Swanson [ 26/Feb/16 ]

rmihael@gmail.com it's not likely. It's a pretty invasive change, and we are generally very hesitant to backport any non-trivial changes to our stable branches unless they fix critical bugs.

Comment by Michael Korbakov [ 25/Feb/16 ]

Are there any chances for this improvement to be backported to 2.6 branch?

Comment by Charlie Swanson [ 21/Oct/15 ]

redbeard0531, what are our current guarantees/behavior of the ordering of documents after a $group stage? From the docs, it just says that it does not order its output documents. Does it return them in any deterministic order? Is that order one that we want to guarantee to the user? If not, then a $limit after a $group seems like it could return any subset (of the correct size) of the grouped documents.

Either way, this optimization makes sense to me, and I would propose moving it into Planning Bucket A, for consideration in the 3.3 Development cycle.

Generated at Thu Feb 08 03:56:00 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.