[SERVER-45326] Poor Aggregation Framework performance in relational queries Created: 31/Dec/19  Updated: 19/Feb/20  Resolved: 19/Feb/20

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: 4.0.9
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: James Daab Assignee: Eric Sedor
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-41171 MongoDB $lookup using "$expr" is slow... Backlog
Related
is related to SERVER-32549 allow $expr with $in expression use a... Backlog
is related to SERVER-37470 Lookup sub-pipeline is not using inde... Backlog
is related to SERVER-34927 allow localField and foreignField wit... Closed
Participants:

 Description   

We've been working on modeling bidirectional relations in MongoDB. We use the common approach of storing ids of related documents in specific fields, e.g.:

"users"
{_id: "user1", items: ["item1"], name: "John"}
 
"items"
{_id: "item1", owner: "user1", type: "blanket"}

We've created a GraphQL server that uses the Aggregation Framework to perform lookups and resolve e.g. such a query:

 

{
  users {
    id
    name
    items {
      id
      type
    }
  }
}

to such a response:

 

 

{
  "users": {
    "id": "user1",
    "name": "John",
    "items": [
      {
        "id": "item1",
        "type": "blanket"
      }
    ]
  }
}

 

 Our reasoning was that using a single aggregation pipeline even for complex, nested queries will easily outperform sequential finds on multiple collections with all their associated round-trips and subsequent result manipulation.

Using the example above, if we wanted to fetch some users along with all their items without leveraging the Aggregation Framework, we would have to first query the "users" collection and get the list of associated item ids, then query the "items" collection using those ids and finally connect users and items according to their relationships.

The Aggregation Framework allows us to define all those tasks in a single pipeline and delegate their completion to MongoDB. The only problem is that it's much slower than the find-based approach and the performance gap only grows as we add more complexity. Here's a link to a simple benchmark, executed in the MongoDB, console that illustrates the problem:

https://gist.github.com/alethes/369faf2e1c5a83f4fbcaca9001dbd501

Sample results on my system are:

Aggregation 1: 387 ms
Aggregation 2: 103 ms
Find: 23 ms

The performance characteristics are almost identical when we use Node.js or C++ drivers.

What could cause such discrepancies? Shouldn't a single, internal aggregation outperform a sequence of 3 find requests along with some JS data manipulation? Are there any issues in our pipeline that lead to slower execution?

 



 Comments   
Comment by Eric Sedor [ 19/Feb/20 ]

Thanks for your patience james@togedo.com,

It looks to me like there is room for improvement as well. After examination I'm going to close this as a duplicate of SERVER-41171, which tracks performance differences between expressive and traditional $lookup. Please watch that ticket for updates!

Gratefully,
Eric

Comment by James Daab [ 03/Jan/20 ]

Thank you for that suggestion. The localField/foreignField variant is definitely faster. I've added it to the Gist: https://gist.github.com/alethes/369faf2e1c5a83f4fbcaca9001dbd501.

In this case, it's about two times more performant than the previous best aggregation, but still about two times slower than the find-based version. This is surprising to me as the latter involves three separate requests along with all the associated JSON parsing and serialization as well as additional JS manipulation of flat query responses. Of course, the execution, in this case, would be slower if I didn't use a local MongoDB instance.

Regardless, it seems the aggregation variant could be significantly faster due to a single processing pipeline, direct database access and the efficiency of C++. I've tried to narrow it down further and it looks like an aggregation with just the $match is even faster than an equivalent find. A $match with a single  localField/foreignField $lookup is already about two times slower than the corresponding find+JS version. The $unwind and $project operations don't seem to incur any significant costs. The test cases are included in the Gist.

I'm not sure how the localField/foreignField lookups work internally, but based on those comparisons I suspect there's some room for improvement. I believe MongoDB would hugely benefit from having a way to perform queries across collections that could outperform multiple requests and client-side data manipulation.

P.S. Best wishes for the New Year!

Comment by Eric Sedor [ 31/Dec/19 ]

I aplogize james@togedo.com,

You are right I was mis-using the variable reference above; my suggestion is not a workaround and I'll edit that comment to correct it.

There are two closely related issues open around $expr+$in (SERVER-37470, SERVER-32549). You can watch those tickets for updates. I am currently investigating which of these issues is the appropriate duplicate ticket for this case.

One workaround proposed in the comments of those tickets is to use the localField/foreignField form of $lookup, which does reach into arrays and can use indexes. For example, I was able to work around using the following form:

db.items.aggregate([
    {
      $match: { size: { $lt: 0.5 } }
    },
    {
      $lookup: {
        from: "users",
        localField: "_id",
        foreignField: "items",
        as: "owner"
      }
    },
    {$unwind:"$owner"},
    {
      $lookup: {
        from: "users",
        localField: "owner.partner",
        foreignField: "_id",
        as: "owner.partner"
      }
    },
    {$unwind:"$owner.partner"}
  ])

Comment by James Daab [ 31/Dec/19 ]

Hi @Eric Sedor,

thank you for your answer. The use of $expr  is certainly problematic, but it seems there's no way to query the collection with $match while having access to aggregation expressions. I'm not sure if you meant it literally, but _$match: { _id:

{ $in: ["$items"] }

}} doesn't work in the same way as the _$expr. Doesn't it simply filter documents with _id equal to the literal_"$items"_ string? The documentation states that raw aggregation expressions aren't available in the $match pipeline stage.

Perhaps the main problem here is the lack of support for efficient lookups with variables. Such a feature seems pretty important for querying related data across multiple collections. The $match-based lookups are quite efficient, but they require constant, predefined filters. Is there any way to achieve similar performance when variables are involved? If not, is there any chance of this changing in the foreseeable future?

Comment by Eric Sedor [ 31/Dec/19 ]

Hi james@togedo.com

While it isn't necessarily true that a single operation will be faster, the major reason for the difference here looks to be the use of the $expr operator, which is not always ... to each other. does not use indexes well due to SERVER-37470SERVER-32549 (see my next comment below).

That said, the best way to diagnose a complex problem involving the comparison of performance of various client strategies and query operators is our community at mongodb-user group or on Stack Overflow with the mongodb tag. Starting there can help narrow down a larger problem to a specific bug or feature suggestion for the MongoDB server.

Does this help?

Eric

Edit: Correcting P1 to be more specific and avoid implying a bad workaround

Generated at Thu Feb 08 05:08:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.