[SERVER-13703] Presence of extraneous $project causes a less efficient plan to be used Created: 23/Apr/14 Updated: 13/Nov/23 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Aggregation Framework |
| Affects Version/s: | 2.6.0 |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Asya Kamsky | Assignee: | Backlog - Query Optimization |
| Resolution: | Unresolved | Votes: | 2 |
| Labels: | optimization, query-44-grooming | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Assigned Teams: |
Query Optimization
|
||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||
| Sprint: | Query 2017-03-27, Query 2017-04-17, QO 2022-10-03 | ||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||
| Description |
|
Even though the user explicitly tells us to include specific fields, we can see that they are not actually being used, so doesn't it make sense to optimize $project away the same as if it wasn't there?
Without the (needless) $project
|
| Comments |
| Comment by David Storch [ 05/Oct/22 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I'm re-opening this ticket. The fix was implemented in the new query optimizer that is under development by the Query Optimization team right now. We expect this issue to fixed as part of the release of the new optimizer. However, the optimizer is not currently scheduled for release in 6.2 (its development is happening behind a feature flag), so we should not have this ticket marked as fixed with a 6.2 fixVersion. We are currently discussing as a team how to best represent this situation in jira. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 03/Oct/22 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'Naama Bareket', 'email': 'naama.bareket@mongodb.com', 'username': 'naama-bareket'}Message: SERVER-13703: Presence of extraneous $project causes a less efficient plan to be used | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Maxime Beaudry [ 29/Aug/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I agree. I have opened SERVER-15099. Still, thanks a lot for your help! | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Asya Kamsky [ 29/Aug/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Maxime - your aggregation is slow because you are unwinding a large array. I don't believe there is any relationship to this server ticket. You might want to open a separate enhancement request for optimization of large unwind operations. I'm not sure how much can be done about it, but it wouldn't be part of this ticket. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Maxime Beaudry [ 27/Aug/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I just created a backup of my database by running mongodump.exe on my bigger machine (the one with lots of memory). Here is the output of the command:
As you can see, reading the complete database and writing it to disk takes about 8 seconds. Considering that my aggregation query that used the $unwind took 36s, it is hard to understand why writing to disk (in my opinion this is one of the slowest thing a computer can do) can be 4.5x times faster than running a query on data that is already loaded in memory. Considering that my collection has 10000 documents that each have 10 elements within 'ChildrenLevel1', it means that each of my top level document takes 36s / 10000 = 3.6ms to process. In my opinion, this is very long since the data is already loaded in RAM. Furthermore, since we've seen while doing the backup that the disk are quite fast, it means that the CPU is very busing doing something. We are CPU bound and not IO bound. Please don't take these stats or opinions as criticism. I am just trying to layout the facts so that we can find a solution. Once again, I really want to find a solution to this issue. It is key in our strategy to kick SQL server out. Now back to the subject of the backup. I am now compressing the backup and it will be about 1GB big. If you prefer, I can provide you a command line executable (Windows) that is very small and that can be used to create the database in a few minutes. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Maxime Beaudry [ 26/Aug/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I tried without the $project on my "real" database that has a significant amount of data. The db is 5GB big which is still quite small in my opinion. The first step that I did is try to get the max value of a field in the top level document:
This takes virtually 0ms. It is very fast. I don't even see an entry in the Mongo log file for this query. I now go a step further and try to find the max value of a field within an array:
This takes 3.2 seconds. There is therefore a huge impact in unwinding the array even if I don't use the $project: FYI, here is the output of 'explain':
Note that if I change my $match condition so that 10 times more documents go through the pipeline, the first query still takes virtually 0ms (once data is loaded in RAM). The second query (the one with the $unwind) now takes 112s. We can therefore clearly see that there is a huge impact in unwinding the array. These tests were done on my dev machine. I tried with another dedicated machine that has plenty of RAM (30GB were still unused after my tests). With this machine, the first query is still very fast and the second query took a minimum of 36s, which is still not acceptable. Note that I can provide my test database if it can help to go forward with this investigation. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Asya Kamsky [ 26/Aug/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
You don't need to project unless you need to transform some fields (using some of many operators $project supports) so you should reference fields inside arrays the same way you do during[ find|http://docs.mongodb.org/manual/core/document/#dot-notation] for your example:
edit Note the explain which shows which fields were kept by Agg pipeline:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Maxime Beaudry [ 26/Aug/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Asya, I will try to show in the simplest way what I am trying to do and how I am doing it. Hopefully you will be able to show me a better way to do it that is faster. Let's assume that all my documents in a collection look like the following:
As you can see, each documents contains sub documents and arrays of documents. What I want to do is count the number of 'status' based on the 'operation'. So here is the query that I use to do it:
Note that in real life I would use a $match operator at the beginning of the pipeline to minimize the number of documents that go through it. The $match condition would use an index for fast results. Given that in the end I want to aggregate on something that is within an array, I have the feeling that I need to $project the field that I want to $unwind. So it is not really that I want to project unused fields. It is more that I want to project arrays that contains useless fields. If there is another way to write this query that does not project anything useless, please let me know. I would be really glad to hear about it. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Asya Kamsky [ 26/Aug/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Maxime, Why are you projecting fields that you don't need later in the pipeline? Correct me if I misread something in the duplicate ticket and that's not what's going on. Asya | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Maxime Beaudry [ 25/Aug/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Mathias, I am the one that created the duplicate issue I would love to contribute to this issue either by brainstorming or by coding. This issue is really blocking our adoption of Mongo as a replacement to SQL Server... which is impacting our pricing model... which affects our ability to sell our product at a reasonable price! So let me know of the best way to discuss about this issue. Thanks, | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mathias Stearn [ 24/Jul/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
To clarify, we already remove unused fields from the input documents. This would be removing unused fields from the intermediate stages that the used explicitly added. This can then have the knock-on effect of reducing the fields needed from the input documents. Currently we "trust" that all fields the user explicitly adds or requests are actually needed. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Thomas Rueckstiess [ 24/Jul/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Feature request is: Inspect the aggregation pipeline up front and optimize for unused fields. |