[SERVER-27115] Track fields renamed by $project in aggregation for index consideration Created: 18/Nov/16  Updated: 12/Dec/22  Resolved: 30/May/17

Status: Closed
Project: Core Server
Component/s: Aggregation Framework, Querying
Affects Version/s: None
Fix Version/s: 3.5.8

Type: Bug Priority: Major - P3
Reporter: Chris Harris Assignee: David Storch
Resolution: Done Votes: 0
Labels: bi-performance, neweng
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-31729 Views: index on underlying collection... Closed
Problem/Incident
causes SERVER-31696 Aggregation's renamed fields analysis... Closed
Related
related to SERVER-72037 Allow $replace(With|Root) to particip... Open
related to SERVER-5477 when sharded, no need to merge groups... Backlog
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

> db.version()
3.4.0-rc1
> c.createIndex({x:1})
{
	"createdCollectionAutomatically" : false,
	"numIndexesBefore" : 1,
	"numIndexesAfter" : 2,
	"ok" : 1
}
> c.explain().aggregate([{$project:{x_renamed:"$x"}},{$match:{"x_renamed" : 1}}])
{
	"stages" : [
		{
			"$cursor" : {
				"query" : {
					
				},
				"fields" : {
					"x" : 1,
					"_id" : 1
				},
				"queryPlanner" : {
					"plannerVersion" : 1,
					"namespace" : "test.col",
					"indexFilterSet" : false,
					"parsedQuery" : {
						
					},
					"winningPlan" : {
						"stage" : "COLLSCAN",
						"direction" : "forward"
					},
					"rejectedPlans" : [ ]
				}
			}
		},
		{
			"$project" : {
				"_id" : true,
				"x_renamed" : "$x"
			}
		},
		{
			"$match" : {
				"x_renamed" : 1
			}
		}
	],
	"ok" : 1
}
>
>//Compared to $matching on the original field name (after SERVER-19153)
>
> c.explain().aggregate([{$project:{x:1}},{$match:{"x" : 1}}])
{
	"stages" : [
		{
			"$cursor" : {
				"query" : {
					"x" : 1
				},
				"fields" : {
					"x" : 1,
					"_id" : 1
				},
				"queryPlanner" : {
					"plannerVersion" : 1,
					"namespace" : "test.col",
					"indexFilterSet" : false,
					"parsedQuery" : {
						"x" : {
							"$eq" : 1
						}
					},
					"winningPlan" : {
						"stage" : "FETCH",
						"inputStage" : {
							"stage" : "IXSCAN",
							"keyPattern" : {
								"x" : 1
							},
							"indexName" : "x_1",
							"isMultiKey" : false,
							"multiKeyPaths" : {
								"x" : [ ]
							},
							"isUnique" : false,
							"isSparse" : false,
							"isPartial" : false,
							"indexVersion" : 2,
							"direction" : "forward",
							"indexBounds" : {
								"x" : [
									"[1.0, 1.0]"
								]
							}
						}
					},
					"rejectedPlans" : [ ]
				}
			}
		},
		{
			"$project" : {
				"_id" : true,
				"x" : true
			}
		}
	],
	"ok" : 1
}

Sprint: Query 2017-03-27, Query 2017-04-17, Query 2017-05-08, Query 2017-05-29, Query 2017-06-19
Participants:
Case:
Linked BF Score: 0

 Description   

Simply renaming a field in a $project currently disqualifies it from index consideration. The database should track simple field name changes in order to preserve index options.

This is particularly relevant since SERVER-19153 has been implemented and views are being introduced. In addition to BI use cases, consider that views would allow for more compact field names be stored in the collection and then expanded out via the view definition (eg: fn -> firstname).

Implementation of this feature would allow for more flexibility in how logically equivalent aggregation pipelines are written while optimal performance is maintained.



 Comments   
Comment by Githook User [ 30/May/17 ]

Author:

{u'username': u'dstorch', u'name': u'David Storch', u'email': u'david.storch@10gen.com'}

Message: SERVER-27115 Allow $match to swap across renames expressed using $map.
Branch: master
https://github.com/mongodb/mongo/commit/a702053750a5a17071e539fc29910809881ad7c7

Comment by David Storch [ 23/May/17 ]

asya, I've found another example in which swapping is incorrect when the renamed-from path is dotted. This time the match expression which causes the problem is simply testing equality to a scalar---no $elemMatch or comparison to an array required:

> db.c.insert({a: [{b: [{c: 1}, {c: 2}]}, {b: [{c: 3}, {c: 4}]}]});
> db.c.aggregate([{$project: {z: "$a.b.c"}}, {$match: {z: 1}}]);
// No results.
> db.c.aggregate([{$match: {"a.b.c": 1}}, {$project: {z: "$a.b.c"}}]);
{ "_id" : ObjectId("5924b61782a3274a93284c12"), "z" : [ [ 1, 2 ], [ 3, 4 ] ] }

In this case, the swap causes us to return additional results rather than causing us to miss results. I don't think it's safe to perform the swap when the renamed-from path is dotted unless we could somehow know that there are no arrays along that path.

Comment by Githook User [ 19/May/17 ]

Author:

{u'username': u'dstorch', u'name': u'David Storch', u'email': u'david.storch@10gen.com'}

Message: SERVER-27115 Allow $type on a renamed field to swap before the rename.
Branch: master
https://github.com/mongodb/mongo/commit/2aaa0eafa5f1c6e1c43c1f42fcf7975722c3fbfe

Comment by David Storch [ 20/Apr/17 ]

Remaining tasks:

  • Investigate getting $type to participate in $match swapping (possibly by making it inherit from LeafMatchExpression).
  • Add support for swapping dotted field paths in the cases that it is correct to do so.
  • Investigate adding support for array renames that are accomplished using $map.
Comment by Githook User [ 18/Apr/17 ]

Author:

{u'username': u'dstorch', u'name': u'David Storch', u'email': u'david.storch@10gen.com'}

Message: SERVER-27115 extend $match swapping optimization to handle renamed fields

If a field renamed by $project or $addFields is used in a
subsequent $match, we can now swap the $match and update it
to use the original (or, "renamed from") field name. This
allows $match planning to result in better index usage in
some cases.
Branch: master
https://github.com/mongodb/mongo/commit/b9f9195390142f4e1363dc83b986ead5dc8993b8

Comment by David Storch [ 13/Apr/17 ]

asya, got it. I think it should be possible to implement this optimization in some cases for renames of dotted paths, with some additional checks related to whether or not the $match involves arrays. May current plan is to first implement the optimization without support for dotted paths, and then extend to allow dotted paths in certain cases.

Comment by Asya Kamsky [ 12/Apr/17 ]

db.c.aggregate([{$project: {_id: 0, "c.d": "$a.b"}}, {$match: {"c.d": [1, 2]}}]);

This is matching a literal array. so I guess dotted notation and any sort of array matching is not compatible.

My concern is that a common pattern is renaming dotted fields to remove the dot for compatibility with third party tools - i.e.

db.c.aggregate({$project:{"a_b":"$a.b", "a_c":"$a.c", "a_d":"$a.d"}},{$match:{"a_b":{$gt:5}}})

Comment by Githook User [ 12/Apr/17 ]

Author:

{u'username': u'dstorch', u'name': u'David Storch', u'email': u'david.storch@10gen.com'}

Message: SERVER-27115 refactor tests for local pipeline optimization to use TEST macro
Branch: master
https://github.com/mongodb/mongo/commit/265a51a2ab7aca79b1d5286c546a46bab0154756

Comment by David Storch [ 11/Apr/17 ]

asya, good question. It is true that $elemMatch predicates have never been able to participate in our $match splitting/swapping optimizations. However, I think this applies to dotted paths regardless of whether or not there is an $elemMatch predicate. This example, similar to the one above, should show why:

> db.c.aggregate([{$project: {_id: 0, "c.d": "$a.b"}}, {$match: {"c.d": [1, 2]}}]);
{ "c" : { "d" : [ 1, 2 ] } }
> db.c.aggregate([{$match: {"a.b": [1, 2]}}, {$project: {_id: 0, "c.d": "$a.b"}}]);
// No results.

Comment by Asya Kamsky [ 11/Apr/17 ]

Isn't it combination of arrays and $elemMatch in $match that's not optimizable? Wouldn't non-array matching operators work as expected?

Comment by David Storch [ 11/Apr/17 ]

This optimization is not correct when the field path being renamed is dotted. Consider the following collection:

db.c.drop();
db.c.insert({a: [{b: 1}, {b: 2}]});

Suppose we have a pipeline which "renames" the path "a.b" to "c.d". The semantics of the $project stage for arrays mean that this is not merely a rename, but rather a reshaping of the document:

> db.c.aggregate([{$project: {_id: 0, "c.d": "$a.b"}}]);
{ "c" : { "d" : [ 1, 2 ] } }

Now consider the same pipeline, where the $project is followed by a $match on the newly created path "c.d":

> db.c.aggregate([{$project: {_id: 0, "c.d": "$a.b"}}, {$match: {"c.d": {$elemMatch: {$eq: 2}}}}]);
{ "c" : { "d" : [ 1, 2 ] } }

On the surface, it may look correct to rewrite this to a $match followed by a $project, where the $match is rewritten to be on path "a.b" instead of path "c.d". However, this would be an error, since it would cause us to miss the matching document:

> db.c.aggregate([{$match: {"a.b": {$elemMatch: {$eq: 2}}}}, {$project: {_id: 0, "c.d": "$a.b"}}]);
// No results!

This limitation is due to the presence of arrays along the renamed path. If there are no arrays along the renamed path, then I believe the optimization is always valid. However, aggregation cannot know a priori which field paths may contain arrays. Although SERVER-15086 added tracking of which indexed paths contain arrays, this information cannot be easily surfaced at the aggregation layer. Even if this information could be surfaced, it is currently tracked only for indexed paths, so the limitation on dotted paths would still apply in many cases.

Generated at Thu Feb 08 04:14:12 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.