[SERVER-13703] Presence of extraneous $project causes a less efficient plan to be used Created: 23/Apr/14  Updated: 13/Nov/23

Status: Backlog
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: 2.6.0
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Asya Kamsky Assignee: Backlog - Query Optimization
Resolution: Unresolved Votes: 2
Labels: optimization, query-44-grooming
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-14159 Aggregation framework performances dr... Closed
is duplicated by SERVER-49306 Optimization for mid-pipeline $projec... Closed
is duplicated by SERVER-82836 UNPACK_TS_BUCKET stage includes field... Closed
Related
related to SERVER-31082 when $count is at the end of multiple... Backlog
related to SERVER-55886 Optimize away unused computed fields Backlog
related to SERVER-25120 aggregation requests generated field ... Closed
Assigned Teams:
Query Optimization
Backwards Compatibility: Fully Compatible
Sprint: Query 2017-03-27, Query 2017-04-17, QO 2022-10-03
Participants:
Case:

 Description   

Even though the user explicitly tells us to include specific fields, we can see that they are not actually being used, so doesn't it make sense to optimize $project away the same as if it wasn't there?

db.t1.aggregate([ {$match:{username:/^user8/}}, {$project:{username:1}},{$group:{_id:1,count:{$sum:1}}}],{explain:true})
{
	"stages" : [
		{
			"$cursor" : {
				"query" : {
					"username" : /^user8/
				},
				"fields" : {
					"username" : 1,
					"_id" : 1
				},
				"plan" : {
					"cursor" : "BtreeCursor username_1",
					"isMultiKey" : false,
					"scanAndOrder" : false,
					"indexBounds" : {
						"username" : [
							[
								"user8",
								"user9"
							],
							[
								/^user8/,
								/^user8/
							]
						]
					},
					"allPlans" : [
						{
							"cursor" : "BtreeCursor username_1",
							"isMultiKey" : false,
							"scanAndOrder" : false,
							"indexBounds" : {
								"username" : [
									[
										"user8",
										"user9"
									],
									[
										/^user8/,
										/^user8/
									]
								]
							}
						}
					]
				}
			}
		},
		{
			"$project" : {
				"username" : true
			}
		},
		{
			"$group" : {
				"_id" : {
					"$const" : 1
				},
				"count" : {
					"$sum" : {
						"$const" : 1
					}
				}
			}
		}
	]

Without the (needless) $project

db.t1.aggregate([ {$match:{username:/^user8/}}, {$group:{_id:1,count:{$sum:1}}}],{explain:true})
{
	"stages" : [
		{
			"$cursor" : {
				"query" : {
					"username" : /^user8/
				},
				"fields" : {
					"_id" : 0,
					"$noFieldsNeeded" : 1
				},
				"plan" : {
					"cursor" : "BtreeCursor username_1",
					"isMultiKey" : false,
					"scanAndOrder" : false,
					"indexBounds" : {
						"username" : [
							[
								"user8",
								"user9"
							],
							[
								/^user8/,
								/^user8/
							]
						]
					},
					"allPlans" : [
						{
							"cursor" : "BtreeCursor username_1",
							"isMultiKey" : false,
							"scanAndOrder" : false,
							"indexBounds" : {
								"username" : [
									[
										"user8",
										"user9"
									],
									[
										/^user8/,
										/^user8/
									]
								]
							}
						}
					]
				}
			}
		},
		{
			"$group" : {
				"_id" : {
					"$const" : 1
				},
				"count" : {
					"$sum" : {
						"$const" : 1
					}
				}
			}
		}
	],
	"ok" : 1
}



 Comments   
Comment by David Storch [ 05/Oct/22 ]

I'm re-opening this ticket. The fix was implemented in the new query optimizer that is under development by the Query Optimization team right now. We expect this issue to fixed as part of the release of the new optimizer. However, the optimizer is not currently scheduled for release in 6.2 (its development is happening behind a feature flag), so we should not have this ticket marked as fixed with a 6.2 fixVersion. We are currently discussing as a team how to best represent this situation in jira.

Comment by Githook User [ 03/Oct/22 ]

Author:

{'name': 'Naama Bareket', 'email': 'naama.bareket@mongodb.com', 'username': 'naama-bareket'}

Message: SERVER-13703: Presence of extraneous $project causes a less efficient plan to be used
Branch: master
https://github.com/mongodb/mongo/commit/246747af9c0896816d73dc439b39ba606e3f4b6f

Comment by Maxime Beaudry [ 29/Aug/14 ]

I agree. I have opened SERVER-15099. Still, thanks a lot for your help!

Comment by Asya Kamsky [ 29/Aug/14 ]

Maxime - your aggregation is slow because you are unwinding a large array.

I don't believe there is any relationship to this server ticket.

You might want to open a separate enhancement request for optimization of large unwind operations. I'm not sure how much can be done about it, but it wouldn't be part of this ticket.

Comment by Maxime Beaudry [ 27/Aug/14 ]

I just created a backup of my database by running mongodump.exe on my bigger machine (the one with lots of memory). Here is the output of the command:

d:\MongoDb\bin>mongodump.exe --db MongoProjectPerformanceProblem --out d:\MongoDb\PerfProblem.direct.backup --dbpath d:\MongoDb\Data
2014-08-27T06:23:07.916-0400 DATABASE: MongoProjectPerformanceProblem    to     d:\MongoDb\PerfProblem.direct.backup\MongoProjectPerformanceProblem
2014-08-27T06:23:08.181-0400 [tools] query MongoProjectPerformanceProblem.system.indexes planSummary: COLLSCAN ntoreturn:0 ntoskip:0 nscanned:1 nscannedObjects:1 keyUpdates:0 numYields:0 locks(micros)
 W:50769 r:199372 nreturned:1 reslen:110 108ms
2014-08-27T06:23:08.181-0400    MongoProjectPerformanceProblem.system.indexes to d:\MongoDb\PerfProblem.direct.backup\MongoProjectPerformanceProblem\system.indexes.bson
2014-08-27T06:23:08.181-0400             1 documents
2014-08-27T06:23:08.181-0400    MongoProjectPerformanceProblem.Demo to d:\MongoDb\PerfProblem.direct.backup\MongoProjectPerformanceProblem\Demo.bson
2014-08-27T06:23:09.834-0400 [tools] warning Listener::getElapsedTimeMillis returning 0ms
2014-08-27T06:23:11.004-0400 [tools]            Collection File Writing Progress: 3900/10000    39% (documents)
2014-08-27T06:23:11.114-0400 [tools] warning Listener::getElapsedTimeMillis returning 0ms
2014-08-27T06:23:12.486-0400 [tools] warning Listener::getElapsedTimeMillis returning 0ms
2014-08-27T06:23:13.875-0400 [tools] warning Listener::getElapsedTimeMillis returning 0ms
2014-08-27T06:23:14.000-0400 [tools]            Collection File Writing Progress: 8300/10000    83% (documents)
2014-08-27T06:23:15.029-0400             10000 documents
2014-08-27T06:23:15.513-0400    Metadata for MongoProjectPerformanceProblem.Demo to d:\MongoDb\PerfProblem.direct.backup\MongoProjectPerformanceProblem\Demo.metadata.json
2014-08-27T06:23:15.528-0400 [tools] dbexit:
2014-08-27T06:23:15.528-0400 [tools] shutdown: going to close listening sockets...
2014-08-27T06:23:15.528-0400 [tools] shutdown: going to flush diaglog...
2014-08-27T06:23:15.528-0400 [tools] shutdown: going to close sockets...
2014-08-27T06:23:15.528-0400 [tools] shutdown: waiting for fs preallocator...
2014-08-27T06:23:15.528-0400 [tools] shutdown: closing all files...
2014-08-27T06:23:15.981-0400 [tools] closeAllFiles() finished
2014-08-27T06:23:15.981-0400 [tools] shutdown: removing fs lock...
2014-08-27T06:23:15.981-0400 [tools] dbexit: really exiting now

As you can see, reading the complete database and writing it to disk takes about 8 seconds. Considering that my aggregation query that used the $unwind took 36s, it is hard to understand why writing to disk (in my opinion this is one of the slowest thing a computer can do) can be 4.5x times faster than running a query on data that is already loaded in memory. Considering that my collection has 10000 documents that each have 10 elements within 'ChildrenLevel1', it means that each of my top level document takes 36s / 10000 = 3.6ms to process. In my opinion, this is very long since the data is already loaded in RAM. Furthermore, since we've seen while doing the backup that the disk are quite fast, it means that the CPU is very busing doing something. We are CPU bound and not IO bound.

Please don't take these stats or opinions as criticism. I am just trying to layout the facts so that we can find a solution. Once again, I really want to find a solution to this issue. It is key in our strategy to kick SQL server out.

Now back to the subject of the backup. I am now compressing the backup and it will be about 1GB big. If you prefer, I can provide you a command line executable (Windows) that is very small and that can be used to create the database in a few minutes.

Comment by Maxime Beaudry [ 26/Aug/14 ]

I tried without the $project on my "real" database that has a significant amount of data. The db is 5GB big which is still quite small in my opinion. The first step that I did is try to get the max value of a field in the top level document:

db.Demo.aggregate([
	 {$match : {_id : {$lte : 1000}}},
	 {$group: {
	           _id : 0, 
	           max : {$max : "$DateTime"}
	          }
	 }
	])

This takes virtually 0ms. It is very fast. I don't even see an entry in the Mongo log file for this query.

I now go a step further and try to find the max value of a field within an array:

  db.Demo.aggregate([
	 {$match : {_id : {$lte : 1000}}},
	 {$unwind: "$ChildrenLevel1"},
	 {$group: {
	           _id : 0, 
	           max : {$max : "$ChildrenLevel1.DateTime"}
	          }
	 }
	])

This takes 3.2 seconds. There is therefore a huge impact in unwinding the array even if I don't use the $project:

FYI, here is the output of 'explain':

{
        "stages" : [
                {
                        "$cursor" : {
                                "query" : {
                                        "_id" : {
                                                "$lte" : 1000
                                        }
                                },
                                "fields" : {
                                        "ChildrenLevel1" : 1,
                                        "_id" : 0
                                },
                                "plan" : {
                                        "cursor" : "BtreeCursor _id_",
                                        "isMultiKey" : false,
                                        "scanAndOrder" : false,
                                        "indexBounds" : {
                                                "_id" : [
                                                        [
                                                                -Infinity,
                                                                1000
                                                        ]
                                                ]
                                        },
                                        "allPlans" : [
                                                {
                                                        "cursor" : "BtreeCursor _id_",
                                                        "isMultiKey" : false,
                                                        "scanAndOrder" : false,
                                                        "indexBounds" : {
                                                                "_id" : [
                                                                        [
                                                                                -Infinity,
                                                                                1000
                                                                        ]
                                                                ]
                                                        }
                                                }
                                        ]
                                }
                        }
                },
                {
                        "$unwind" : "$ChildrenLevel1"
                },
                {
                        "$group" : {
                                "_id" : {
                                        "$const" : 0
                                },
                                "max" : {
                                        "$max" : "$ChildrenLevel1.DateTime"
                                }
                        }
                }
        ],
        "ok" : 1

Note that if I change my $match condition so that 10 times more documents go through the pipeline, the first query still takes virtually 0ms (once data is loaded in RAM). The second query (the one with the $unwind) now takes 112s. We can therefore clearly see that there is a huge impact in unwinding the array. These tests were done on my dev machine. I tried with another dedicated machine that has plenty of RAM (30GB were still unused after my tests). With this machine, the first query is still very fast and the second query took a minimum of 36s, which is still not acceptable.

Note that I can provide my test database if it can help to go forward with this investigation.

Comment by Asya Kamsky [ 26/Aug/14 ]

You don't need to project unless you need to transform some fields (using some of many operators $project supports) so you should reference fields inside arrays the same way you do during[ find|http://docs.mongodb.org/manual/core/document/#dot-notation] for your example:

db.coll.aggregate({$unwind:"$some_array_of_objects"},{$group:{_id:{op:"$some_object.operation", stat:"$some_array_of_objects.status"}, count:{$sum:1}}})
{ "_id" : { "op" : "x", "stat" : "fail" }, "count" : 1 }
{ "_id" : { "op" : "x", "stat" : "pass" }, "count" : 1 }

edit

Note the explain which shows which fields were kept by Agg pipeline:

db.project.aggregate([{$unwind:"$some_array_of_objects"},{$group:{_id:{op:"$some_object.operation", stat:"$some_array_of_objects.status"}, count:{$sum:1}}}],{explain:true})
{
	"stages" : [
		{
			"$cursor" : {
				"query" : {
 
				},
				"fields" : {
					"some_array_of_objects" : 1,
					"some_object.operation" : 1,
					"_id" : 0
				},
				"plan" : {
					"cursor" : "BasicCursor",
					"isMultiKey" : false,
					"scanAndOrder" : false,
					"allPlans" : [
						{
							"cursor" : "BasicCursor",
							"isMultiKey" : false,
							"scanAndOrder" : false
						}
					]
				}
			}
		},
		{
			"$unwind" : "$some_array_of_objects"
		},
		{
			"$group" : {
				"_id" : {
					"op" : "$some_object.operation",
					"stat" : "$some_array_of_objects.status"
				},
				"count" : {
					"$sum" : {
						"$const" : 1
					}
				}
			}
		}
	],
	"ok" : 1
}

Comment by Maxime Beaudry [ 26/Aug/14 ]

Hi Asya,

I will try to show in the simplest way what I am trying to do and how I am doing it. Hopefully you will be able to show me a better way to do it that is faster. Let's assume that all my documents in a collection look like the following:

  {
    "some_object": {
        "operation": "x",
        "field1": "1"
    },
    "some_array_of_objects": [
        {
            "status": "pass",
            "field2": "a-2",
            "field3": "a-3"
        },
        {
            "status": "fail",
            "field2": "b-2",
            "field3": "b-3"
        }
    ]
  }

As you can see, each documents contains sub documents and arrays of documents. What I want to do is count the number of 'status' based on the 'operation'. So here is the query that I use to do it:

  db.OperationRuns.aggregate(
  [
    {$project : {
	   _id : 0,                               // I don't need this in my aggregation so I remove it as soon as possible.
	   operation : "$some_object.operation",  // Since I want to aggregate based on operation, I need to project it.
	   some_array_of_objects : 1              // Since I want to $unwind later, I need to project it
	  }
	},
	
	// Since I the aggregation to be performed on properties of a sub arrays, I need to project it.
	{$unwind : "$some_array_of_objects"},
	
	// I now project only what I need for the group operation. This step may be useless 
	// and I think that I can remove it.
	{$project: {
	    operation: 1,
		status: "$some_array_of_objects.status"
	  }	
	},
	
	// I now to my real aggregation.
	{$group : {
	    _id : {operation : "$operation", status: "$status"},
		sum : {$sum: 1}
	  }
	}
  ]
  )

Note that in real life I would use a $match operator at the beginning of the pipeline to minimize the number of documents that go through it. The $match condition would use an index for fast results.

Given that in the end I want to aggregate on something that is within an array, I have the feeling that I need to $project the field that I want to $unwind. So it is not really that I want to project unused fields. It is more that I want to project arrays that contains useless fields.

If there is another way to write this query that does not project anything useless, please let me know. I would be really glad to hear about it.

Comment by Asya Kamsky [ 26/Aug/14 ]

Maxime,

Why are you projecting fields that you don't need later in the pipeline? Correct me if I misread something in the duplicate ticket and that's not what's going on.

Asya

Comment by Maxime Beaudry [ 25/Aug/14 ]

Hi Mathias,

I am the one that created the duplicate issue SERVER-14159. From the investigations that I made, I see that performances degrade when we either project unused fields or large sub documents. I assume that this is because the aggregation framework copies the data of the original document into a brand new document that is returned by the 'project' operation. Is that true? Is the pipeline really copying the projected field into the resulting document?

I would love to contribute to this issue either by brainstorming or by coding. This issue is really blocking our adoption of Mongo as a replacement to SQL Server... which is impacting our pricing model... which affects our ability to sell our product at a reasonable price! So let me know of the best way to discuss about this issue.

Thanks,
Maxime

Comment by Mathias Stearn [ 24/Jul/14 ]

To clarify, we already remove unused fields from the input documents. This would be removing unused fields from the intermediate stages that the used explicitly added. This can then have the knock-on effect of reducing the fields needed from the input documents. Currently we "trust" that all fields the user explicitly adds or requests are actually needed.

Comment by Thomas Rueckstiess [ 24/Jul/14 ]

Feature request is:

Inspect the aggregation pipeline up front and optimize for unused fields.

Generated at Thu Feb 08 03:32:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.