[SERVER-37715] Use DISTINCT_SCAN for $unwind-$group pipelines Created: 23/Oct/18  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Aggregation Framework, Querying
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: David Storch Assignee: Backlog - Query Optimization
Resolution: Unresolved Votes: 0
Labels: QFB, asya, performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-68567 distinct command on the Time-Series c... Backlog
Related
related to SERVER-55112 Behaviour of distinct differs between... Closed
related to SERVER-27915 Make $group with $addToSet accumulato... Backlog
is related to SERVER-27494 Avoid unwind on multi-key index field... Backlog
is related to SERVER-9507 Optimize $sort+$group+$first pipeline... Closed
Assigned Teams:
Query Optimization
Sprint: Query Optimization 2021-05-17
Participants:

 Description   

This is an extension of SERVER-9507, which allowed $group to be executed using a DISTINCT_SCAN stage. A distinct command over a multikey field can DISTINCT_SCAN a multikey index:

> db.c.drop()
true
> db.c.createIndex({a: 1})
{
	"createdCollectionAutomatically" : true,
	"numIndexesBefore" : 1,
	"numIndexesAfter" : 2,
	"ok" : 1
}
> db.c.insert({a: [1, 2, 3]})
WriteResult({ "nInserted" : 1 })
> db.c.insert({a: [2, 3, 4]})
WriteResult({ "nInserted" : 1 })
> db.c.explain().distinct("a")
{
	"queryPlanner" : {
		"plannerVersion" : 1,
		"namespace" : "test.c",
		"indexFilterSet" : false,
		"parsedQuery" : {
 
		},
		"queryHash" : "840B4B0E",
		"winningPlan" : {
			"stage" : "PROJECTION",
			"transformBy" : {
				"_id" : 0,
				"a" : 1
			},
			"inputStage" : {
				"stage" : "DISTINCT_SCAN",
				"keyPattern" : {
					"a" : 1
				},
				"indexName" : "a_1",
				"isMultiKey" : true,
				"multiKeyPaths" : {
					"a" : [
						"a"
					]
				},
				"isUnique" : false,
				"isSparse" : false,
				"isPartial" : false,
				"indexVersion" : 2,
				"direction" : "forward",
				"indexBounds" : {
					"a" : [
						"[MinKey, MaxKey]"
					]
				}
			}
		},
		"rejectedPlans" : [ ]
	},
	"serverInfo" : {
		"host" : "storchbox",
		"port" : 27017,
		"version" : "0.0.0",
		"gitVersion" : "unknown"
	},
	"ok" : 1
}

When the same distinct operation is expressed using the aggregation framework, however, the $unwind inhibits the DISTINCT_SCAN optimization added in SERVER-9507:

> db.c.explain().aggregate([{$unwind: {path: "$a", preserveNullAndEmptyArrays: true}}, {$group: {_id: "$a"}}])
{
	"stages" : [
		{
			"$cursor" : {
				"query" : {
 
				},
				"fields" : {
					"a" : 1,
					"_id" : 0
				},
				"queryPlanner" : {
					"plannerVersion" : 1,
					"namespace" : "test.c",
					"indexFilterSet" : false,
					"parsedQuery" : {
 
					},
					"queryHash" : "8B3D4AB8",
					"winningPlan" : {
						"stage" : "COLLSCAN",
						"direction" : "forward"
					},
					"rejectedPlans" : [ ]
				}
			}
		},
		{
			"$unwind" : {
				"path" : "$a",
				"preserveNullAndEmptyArrays" : true
			}
		},
		{
			"$group" : {
				"_id" : "$a"
			}
		}
	],
	"ok" : 1
}

This task is to add logic in order to optimize $unwind-$group pipelines to use DISTINCT_SCAN when appropriate. It is closely related to SERVER-27494, which describes how $unwind can sometimes be optimized away even without generating a DISTINCT_SCAN plan.



 Comments   
Comment by Charlie Swanson [ 27/May/21 ]

Moving this back to the backlog user. We thought hana.pearlman might have time to take a look but other things came up. Hopefully someone else will free up shortly.

Generated at Thu Feb 08 04:46:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.