[SERVER-23582] Query planner should not generate DISTINCT_SCAN plan for distinct on dotted field Created: 06/Apr/16  Updated: 17/May/16  Resolved: 17/May/16

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: J Rassi Assignee: David Storch
Resolution: Done Votes: 0
Labels: neweng
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-23561 Distinct on sub-documents array doesn... Closed
related to SERVER-2104 covered index should support dotted f... Closed
Operating System: ALL
Participants:

 Description   

The query planner currently generates DISTINCT_SCAN plans for distinct operations on dotted fields, if the distinct index is non-multikey. Due to SERVER-2104, distinct operations on dotted fields can never be covered, so DISTINCT_SCAN plans should never be generated for them.

See, the following explain output for such an example distinct operation. The query planner generates a full index scan plan with a FETCH node, which is much more costly than a collection scan plan.

> db.foo.drop()
true
> db.foo.ensureIndex({"a.b":1})
{
	"createdCollectionAutomatically" : true,
	"numIndexesBefore" : 1,
	"numIndexesAfter" : 2,
	"ok" : 1
}
> db.foo.explain().distinct('a.b')
{
	"queryPlanner" : {
		"plannerVersion" : 1,
		"namespace" : "test.foo",
		"indexFilterSet" : false,
		"parsedQuery" : {
 
		},
		"winningPlan" : {
			"stage" : "PROJECTION",
			"transformBy" : {
				"_id" : 0,
				"a.b" : 1
			},
			"inputStage" : {
				"stage" : "FETCH",
				"inputStage" : {
					"stage" : "DISTINCT_SCAN",
					"keyPattern" : {
						"a.b" : 1
					},
					"indexName" : "a.b_1",
					"isMultiKey" : false,
					"isUnique" : false,
					"isSparse" : false,
					"isPartial" : false,
					"indexVersion" : 1,
					"direction" : "forward",
					"indexBounds" : {
						"a.b" : [
							"[MinKey, MaxKey]"
						]
					}
				}
			}
		},
		"rejectedPlans" : [ ]
	},
	"serverInfo" : {
		"host" : "rassi",
		"port" : 27017,
		"version" : "3.3.4",
		"gitVersion" : "2ab9b1fd7ea20319e2d9ddb0532234729877703f"
	},
	"ok" : 1
}

Relatedly, the second paragraph of the method comment for the getDistinctNodeIndex() helper is incorrect:

 * Returns true if indices contains an index that can be used with DistinctNode (the "fast distinct
 * hack" node, which can be used only if there is an empty query predicate).  Sets indexOut to the
 * array index of PlannerParams::indices.  Look for the index for the fewest fields.  Criteria for
 * suitable index is that the index cannot be special (geo, hashed, text, ...), and the index cannot
 * be a partial index.
 *
 * Multikey indices are not suitable for DistinctNode when the projection is on an array element.
 * Arrays are flattened in a multikey index which makes it impossible for the distinct scan stage
 * (plan stage generated from DistinctNode) to select the requested element by array index.
 *
 * Multikey indices cannot be used for the fast distinct hack if the field is dotted.  Currently the
 * solution generated for the distinct hack includes a projection stage and the projection stage
 * cannot be covered with a dotted field.

The comment above does not correctly describe the behavior of distinct against an array. The semantics of distinct operations are such that arrays are flattened:

> db.foo.drop()
true
> db.foo.insert({a:1})
WriteResult({ "nInserted" : 1 })
> db.foo.insert({a:[2,3]})
WriteResult({ "nInserted" : 1 })
> db.foo.distinct('a')
[ 1, 2, 3 ]



 Comments   
Comment by David Storch [ 17/May/16 ]

The DISTINCT_SCAN => FETCH plan is desirable in the case that the set of distinct index keys has low cardinality. The DISTINCT_SCAN stage will skip over duplicated keys via an IndexCursor seek, which is often preferable to a COLLSCAN. Closing as Works as Designed.

Generated at Thu Feb 08 04:03:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.