[SERVER-45326] Poor Aggregation Framework performance in relational queries Created: 31/Dec/19 Updated: 19/Feb/20 Resolved: 19/Feb/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Aggregation Framework |
| Affects Version/s: | 4.0.9 |
| Fix Version/s: | None |
| Type: | Question | Priority: | Major - P3 |
| Reporter: | James Daab | Assignee: | Eric Sedor |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Description |
|
We've been working on modeling bidirectional relations in MongoDB. We use the common approach of storing ids of related documents in specific fields, e.g.:
We've created a GraphQL server that uses the Aggregation Framework to perform lookups and resolve e.g. such a query:
to such a response:
Our reasoning was that using a single aggregation pipeline even for complex, nested queries will easily outperform sequential finds on multiple collections with all their associated round-trips and subsequent result manipulation. Using the example above, if we wanted to fetch some users along with all their items without leveraging the Aggregation Framework, we would have to first query the "users" collection and get the list of associated item ids, then query the "items" collection using those ids and finally connect users and items according to their relationships. The Aggregation Framework allows us to define all those tasks in a single pipeline and delegate their completion to MongoDB. The only problem is that it's much slower than the find-based approach and the performance gap only grows as we add more complexity. Here's a link to a simple benchmark, executed in the MongoDB, console that illustrates the problem: https://gist.github.com/alethes/369faf2e1c5a83f4fbcaca9001dbd501 Sample results on my system are:
The performance characteristics are almost identical when we use Node.js or C++ drivers. What could cause such discrepancies? Shouldn't a single, internal aggregation outperform a sequence of 3 find requests along with some JS data manipulation? Are there any issues in our pipeline that lead to slower execution?
|
| Comments |
| Comment by Eric Sedor [ 19/Feb/20 ] | |||||||||||||||||||||||
|
Thanks for your patience james@togedo.com, It looks to me like there is room for improvement as well. After examination I'm going to close this as a duplicate of SERVER-41171, which tracks performance differences between expressive and traditional $lookup. Please watch that ticket for updates! Gratefully, | |||||||||||||||||||||||
| Comment by James Daab [ 03/Jan/20 ] | |||||||||||||||||||||||
|
Thank you for that suggestion. The localField/foreignField variant is definitely faster. I've added it to the Gist: https://gist.github.com/alethes/369faf2e1c5a83f4fbcaca9001dbd501. In this case, it's about two times more performant than the previous best aggregation, but still about two times slower than the find-based version. This is surprising to me as the latter involves three separate requests along with all the associated JSON parsing and serialization as well as additional JS manipulation of flat query responses. Of course, the execution, in this case, would be slower if I didn't use a local MongoDB instance. Regardless, it seems the aggregation variant could be significantly faster due to a single processing pipeline, direct database access and the efficiency of C++. I've tried to narrow it down further and it looks like an aggregation with just the $match is even faster than an equivalent find. A $match with a single localField/foreignField $lookup is already about two times slower than the corresponding find+JS version. The $unwind and $project operations don't seem to incur any significant costs. The test cases are included in the Gist. I'm not sure how the localField/foreignField lookups work internally, but based on those comparisons I suspect there's some room for improvement. I believe MongoDB would hugely benefit from having a way to perform queries across collections that could outperform multiple requests and client-side data manipulation. P.S. Best wishes for the New Year! | |||||||||||||||||||||||
| Comment by Eric Sedor [ 31/Dec/19 ] | |||||||||||||||||||||||
|
I aplogize james@togedo.com, You are right I was mis-using the variable reference above; my suggestion is not a workaround and I'll edit that comment to correct it. There are two closely related issues open around $expr+$in (SERVER-37470, SERVER-32549). You can watch those tickets for updates. I am currently investigating which of these issues is the appropriate duplicate ticket for this case. One workaround proposed in the comments of those tickets is to use the localField/foreignField form of $lookup, which does reach into arrays and can use indexes. For example, I was able to work around using the following form:
| |||||||||||||||||||||||
| Comment by James Daab [ 31/Dec/19 ] | |||||||||||||||||||||||
|
Hi @Eric Sedor, thank you for your answer. The use of $expr is certainly problematic, but it seems there's no way to query the collection with $match while having access to aggregation expressions. I'm not sure if you meant it literally, but _$match: { _id: { $in: ["$items"] }}} doesn't work in the same way as the _$expr. Doesn't it simply filter documents with _id equal to the literal_"$items"_ string? The documentation states that raw aggregation expressions aren't available in the $match pipeline stage. Perhaps the main problem here is the lack of support for efficient lookups with variables. Such a feature seems pretty important for querying related data across multiple collections. The $match-based lookups are quite efficient, but they require constant, predefined filters. Is there any way to achieve similar performance when variables are involved? If not, is there any chance of this changing in the foreseeable future? | |||||||||||||||||||||||
| Comment by Eric Sedor [ 31/Dec/19 ] | |||||||||||||||||||||||
|
While it isn't necessarily true that a single operation will be faster, the major reason for the difference here looks to be the use of the $expr operator, which That said, the best way to diagnose a complex problem involving the comparison of performance of various client strategies and query operators is our community at mongodb-user group or on Stack Overflow with the mongodb tag. Starting there can help narrow down a larger problem to a specific bug or feature suggestion for the MongoDB server. Does this help? Eric Edit: Correcting P1 to be more specific and avoid implying a bad workaround |