[SERVER-29444] Use a covered and streaming plan for $group, $sum:1 queries Created: 04/Jun/17  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Nicolas Dascanio Assignee: Backlog - Query Optimization
Resolution: Unresolved Votes: 11
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-69257 COUNT_SCAN plan not selected when unf... Blocked
related to SERVER-4507 aggregation: optimize $group to take... Backlog
is related to SERVER-57518 16% performance loss switching from C... Backlog
is related to SERVER-20066 Query planner should consider index s... Closed
Assigned Teams:
Query Optimization
Sprint: Query 2019-07-29
Participants:
Case:

 Description   

A very common use of an aggregation is to count elements by group:

db.items.aggregate([ {$group: {_id: "$category", count: {$sum: 1}}}])

If there is an index over category, this pipeline could be covered and no document retrieval is necessary.

Since this is a quite common pattern, you may take it into account and make use of indexes.



 Comments   
Comment by David Storch [ 06/Aug/19 ]

After reviewing this again, I agree that this is not quite a duplicate of an existing ticket. It's closely related to SERVER-4507 (streaming $group) and SERVER-20066 (which, if enabled by default, would allow this query to use a whole covered index scan rather than a COLLSCAN). The request is to ensure that streaming $group plans can be covered by an index. This improvement may or may not fall out of the natural implementation for SERVER-4507.

Comment by Nicolas Dascanio [ 24/Jun/17 ]

I understand that when you have both the index and the collection in memory, a full scan on the index or on the collection should be similar. However, indexes should always be "hot" or fit entirely in memory, while it's not always possible to have an entire collection in memory.

You may have several collections and all it's indexes in memory. While one particular collection may fit in memory, it's unlikely that all of them could be in memory. Obviously, this has a lot to do with the size of the collections.

Thanks for the interest and responses

Comment by Asya Kamsky [ 24/Jun/17 ]

stoma to add to what's been said already, in your example, if there is an index over category, this pipeline could be covered and no document retrieval is necessary, however, our testing has consistently shown that unless you have huge documents or the collection is cold (not in memory) and the index is hot (in memory), the covered index scan for the pipeline you provide is not faster than collection scan.

You can see some of this discussed in SERVER-23406 which tracks ways we may be able to improve index scans.

In addition, we are tracking work in SERVER-4507 for $group to be able to take advantage of sorted sequences (if group key is sorted as when it's coming from an index). Again, preliminary testing seemed to indicate that for the most common case (like yours) the speedup is only on the order of 10% or so, which is why that work wasn't prioritized higher.

I believe your request is really for $group to be faster in simple pipelines, like the one you list, rather than just merely to use an index.

Best,
Asya Kamsky

Comment by Kyle Suarez [ 20/Jun/17 ]

SERVER-20066 has been completed; however, it will not be enabled by default, as it may cause performance regressions for certain workloads. Therefore, I'm going to re-open this ticket and send it to the Query Team for triage so it can be considered separately in the next round of planning.

Comment by Kyle Suarez [ 05/Jun/17 ]

david.storch – my bad, I indeed ran that test on my development branch for SERVER-20066

stoma, we are currently working on SERVER-20066. That ticket is a general query planner improvement that will allow for covered whole index scans even when there is no query predicate (or for aggregation, even when there is no $match stage). When that ticket is resolved, your aggregation will then utilize a covered IXSCAN plan like in my earlier comment. As such, I'm going to close this ticket – please feel free to follow SERVER-20066 for updates.

Regards,
Kyle

Comment by Nicolas Dascanio [ 05/Jun/17 ]

Kyle Suares, yes, it's a typo. Feel free to edit the description:

db.items.aggregate([ {$group: {_id: "$category", count: {$sum: 1}}}])

Comment by David Storch [ 05/Jun/17 ]

kyle.suarez, I'm not observing the same thing on master:

> db.c.explain().aggregate([{$group: {_id: "$aggregate", count: {$sum: 1}}}])
{
	"stages" : [
		{
			"$cursor" : {
				"query" : {
 
				},
				"fields" : {
					"aggregate" : 1,
					"_id" : 0
				},
				"queryPlanner" : {
					"plannerVersion" : 1,
					"namespace" : "test.c",
					"indexFilterSet" : false,
					"parsedQuery" : {
 
					},
					"winningPlan" : {
						"stage" : "COLLSCAN",
						"direction" : "forward"
					},
					"rejectedPlans" : [ ]
				}
			}
		},
		{
			"$group" : {
				"_id" : "$aggregate",
				"count" : {
					"$sum" : {
						"$const" : 1
					}
				}
			}
		}
	],
	"ok" : 1
}

But won't this be fixed once your changes for SERVER-20066 hit master?

Comment by Kyle Suarez [ 05/Jun/17 ]

SERVER-22903 only kicks in when the pipeline has no dependency. Here, the pipeline has a dependency on "category". But we already get a covered IXSCAN for this on 3.5.x; I just don't know what ticket is responsible for it.

> db.items.createIndex({category: 1})
{
	"createdCollectionAutomatically" : false,
	"numIndexesBefore" : 2,
	"numIndexesAfter" : 2,
	"note" : "all indexes already exist",
	"ok" : 1
}
> db.items.explain().aggregate({: {_id: "", count: {: 1}}})
{
	"stages" : [
		{
			"" : {
				"query" : {
					
				},
				"fields" : {
					"category" : 1,
					"_id" : 0
				},
				"queryPlanner" : {
					"plannerVersion" : 1,
					"namespace" : "test.items",
					"indexFilterSet" : false,
					"parsedQuery" : {
						
					},
					"winningPlan" : {
						"stage" : "PROJECTION",
						"transformBy" : {
							"category" : 1,
							"_id" : 0
						},
						"inputStage" : {
							"stage" : "IXSCAN",
							"keyPattern" : {
								"category" : 1
							},
							"indexName" : "category_1",
							"isMultiKey" : false,
							"multiKeyPaths" : {
								"category" : [ ]
							},
							"isUnique" : false,
							"isSparse" : false,
							"isPartial" : false,
							"indexVersion" : 2,
							"direction" : "forward",
							"indexBounds" : {
								"category" : [
									"[MinKey, MaxKey]"
								]
							}
						}
					},
					"rejectedPlans" : [ ]
				}
			}
		},
		{
			"" : {
				"_id" : "",
				"count" : {
					"" : {
						"" : 1
					}
				}
			}
		}
	],
	"ok" : 1
}

Note that the user is missing the "$" in front of category, but I assume this is just a small typo in the description of this ticket.

Comment by Kelsey Schubert [ 05/Jun/17 ]

Pretty sure we do this already in 3.4: SERVER-22093.

Generated at Thu Feb 08 04:20:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.