[SERVER-21284] $lookup should cache query results Created: 04/Nov/15  Updated: 02/Mar/23

Status: Backlog
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: 3.2.0-rc1
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Benjamin Appréderisse Assignee: Backlog - Query Optimization
Resolution: Unresolved Votes: 11
Labels: optimization, performance, query-product-scope-1, query-product-urgency-2, query-product-value-2
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-64021 Lookup stage poor performance Closed
Related
is related to SERVER-21312 $lookup should batch query requests Backlog
is related to SERVER-49461 Slow performance when doing lookup in... Closed
Assigned Teams:
Query Optimization
Participants:

 Description   

If you are doing an aggregation on a large collection, and your aggregation contains a $lookup stage which queries a smaller collection (think <= 100 documents), the $lookup stage will issue a query for each document in the larger collection.

The $lookup stage should cache the results of the query to avoid unnecessary work in situations like this.

Original Description

I have tried the $lookup aggregator and the performance was not as expected.
I have one collection with about 350K documents and one with 100 documents.
There are two indexes on the “localfield” and on the “foreignfield”.
I am trying to join the first collection with the second.

After an exchange of emails with Norberto : "it looks like it is not using either indexes, and therefore it's taking so long. "

Please find above the link to test:
https://github.com/bappr/lookup-test



 Comments   
Comment by Xiaoqiang Zhou [ 18/Aug/22 ]

Any progress on this?

Comment by Labros Papadopoulos [ 07/Feb/22 ]

Are there any updates related to this issue? Our company have ran into this issue. Even though we have been using the embed documents approach, and the database works blazing fast when executing the standard CRUD operations, we can't avoid some of the lookup "join" operations that are required for the generation of certain reports. Even when using indexing to support those join operations, when joining tables that have 100K+ documents the operations take more than 30 secs! It's a petty that this issue hasn't yet been resolved, because aside of that, the MongoDb has done a stellar job!

Comment by Charlie Swanson [ 05/Nov/15 ]

Ok great! I have filed SERVER-21312 if you want to track that ticket as well.

Comment by Norberto Fernando Rocha Leite (Inactive) [ 04/Nov/15 ]

Hi Benjamin,

He have done a bit more of drilling into the issue you raised and as charlie.swanson is mentioning the system is actually using the indexes accordingly.
The problem will end up with the cardinality of the overall operation that will need to run a query 350K times on the from collection the data to accomplish the $lookup.

N.

Comment by Benjamin Appréderisse [ 04/Nov/15 ]

Hi Charlie,

Thanks for giving me these explanations.

We can use this way.

Benjamin

Comment by Charlie Swanson [ 04/Nov/15 ]

Hi bappr,

I have been in communication with Norberto, and it looks like the $lookup stage is indeed using an index on the foreign collection. It is not using one on the local collection, but that is expected, and probably the fastest way to proceed, as it is looking at all documents in the collection.

So my impression is that $lookup is behaving exactly as expected. It is probably slow because it has to do 350K queries. If this is significantly slower than doing those queries on their own, that would indeed be unexpected, and a likely bug.

That said, there are many potential optimizations that we are not yet taking advantage of for the $lookup stage. For instance, we could maintain a cache of the results of the queries, or batch the lookups to do fewer queries. These are things we have thought about doing, but do not have tickets in JIRA yet.

If this is more what you meant by "performance is not as expected", we can morph this ticket into the request to cache the results, and open another ticket for batching the lookups. If that sounds like a good way to proceed, let me know and I'll go ahead with it.

Generated at Thu Feb 08 03:56:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.