[SERVER-40235] Is there a workaround to improve perfs on $lookup with $expr ? Created: 20/Mar/19  Updated: 26/Mar/19  Resolved: 26/Mar/19

Status: Closed
Project: Core Server
Component/s: Aggregation Framework, Index Maintenance, Performance
Affects Version/s: 4.0.5
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Romain Masclef Assignee: Danny Hatcher (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Participants:

 Description   

Hello,

I'm facing the `16Mb doc size` hard limit on an aggregation using a `$lookup`.

I have two collections: `tag` and `video`

A `tag` looks like this:

{
    "_id" : ObjectId("5c81217506b84b3e13cc106c"),
    "..."
    "created_at" : ISODate("2015-03-02T09:51:26.315Z"),
    "updated_at" : ISODate("2015-03-02T09:51:26.315Z"),
}

A `video` looks like this:

{
    "_id" : ObjectId("5c81223606b84b3e13cca867"),
    {...},
    "tags" : [ 
        {
            "_id" : ObjectId("5c81217606b84b3e13cc27e5"),
            {...}
        },
        {
            "_id" : ObjectId("5c81217506b84b3e13cc148c"),
            "publish_id" : "4f10596a-1a7c-4a18-b81d-0c33c4887425",
            "name" : "jenifer the voice",
            "label" : "Jenifer The Voice",
            "slug" : "jenifer-the-voice"
        }, 
    ]
}

So a `video` can have multiple tags (embded documents) and a `
tag` can be used in multiple `videos`

In `tag` collection, I need to know how much `video` are related to a tag and (for now) I'm constrained to calculate this data everytime I do a find on my interface (yep that's not really cool).

So I added an index on `video`collection for `tags._id` and made the following aggregation:

db.getCollection('tag').aggregate([
  {"$match":  {"_id":  {"$in": [ ... ]}}},
  {"$lookup": { "from": "video", "localField": "_id", "foreignField": "tags._id", "as": "videoTags"}},
  {"$project": {
    "_id": 1,
    "name": "$name",
    "publish_id": "$publish_id",
    "label": "$label",
    "slug": "$slug",
    "created_at": "$created_at",
    "updated_at": "$updated_at",
    "sort": "$sort",
    "archived": "$archived",
    "parent_id": "$parent_id",
    "nbRelatedVideos": {"$size": "$videoTags"}
  }}
])

 
That one works really great until I have too much videos related to a tag ...

It appears that the lookup will set the entire `video` documents in a `videoTags` field ... so it can take a lot of space. As I only need to count the number of videos, I just need one field to count on (like `_id`).

Maybe there is a way I can count that directly in the `$lookup` ?

After some researches, I get to find a solution not to encounter the 16mb hard limit ... but it is clearly not fast as `$expr` does not use multiKey indexes:

 

db.getCollection('tag').aggregate([
  {$match:  {"_id":  {$in: [ ... ]}}},
  {$lookup: {
      from: "video",
      as: "videoTags",
      let: { "tagId": "$_id" },
      pipeline: [
        {$unwind: "$tags"},
        {$match: {
            $expr: {
                $eq: ["$tags._id", "$$tagId"]
            }
        }},
        {$project: {"_id": true}}
      ],
//      "localField": "_id",
//      "foreignField": "tags._id", 
  }},
  {"$project": {
    "_id": 1,
    "name": "$name",
    "publish_id": "$publish_id",
    "label": "$label",
    "slug": "$slug",
    "created_at": "$created_at",
    "updated_at": "$updated_at",
    "sort": "$sort",
    "archived": "$archived",
    "parent_id": "$parent_id",
    "nbRelatedVideos": {"$size": "$videoTags"}
  }},
])

 

Is there any workaround to improve perfs here ?
Maybe there is a way I can do the count directly in the `$lookup` ?

I hope the question is clear enough, do not hesitate to ask for more information if needed.



 Comments   
Comment by Romain Masclef [ 26/Mar/19 ]

 

The parameter is set on the server-side so you would not be able to enable it via any driver.

 

Indeed I asked a silly question here ^^

Thank you for your help ! I'll try and ask the mongo community if there is any way to optimize this request

Comment by Danny Hatcher (Inactive) [ 26/Mar/19 ]

Hello Romain,

BTW, is there a way to do this using the GO_DRIVER ?

The parameter is set on the server-side so you would not be able to enable it via any driver.

I did not find any doc on this feature, should I open a ticket ?

You are correct; I have opened DOCS-12573 to document this feature.

For the rest of your questions, as they regard to query optimization and not a bug within MongoDB, I encourage you to ask our community by posting on the mongodb-user group or on Stack Overflow with the mongodb tag. The people in those forums have a lot of experience with writing complex queries for MongoDB and may be able to help.

Thank you,

Danny

Comment by Romain Masclef [ 26/Mar/19 ]

I just tested the solution and it seems to work well though I still think their must be a better way to do that ^^

Comment by Romain Masclef [ 26/Mar/19 ]

Hi @DannyHatcher, thks a lot for your answer !!
I'll try this right now

BTW, is there a way to do this using the GO_DRIVER ?

I did not find any doc on this feature, should I open a ticket ? (neither on `https://docs.mongodb.com/manual/reference/limits/#BSON-Document-Size` or `https://docs.mongodb.com/manual/reference/configuration-options/#setparameter-option` pages)

I've also seen some threads about using `$unwind` right after the `$lookup` but can't find examples on it too

Plus I really do think their is a way to optimize my request and get the count on the `$lookup` output but I can't find out how ...

Comment by Danny Hatcher (Inactive) [ 21/Mar/19 ]

Hello Romain,

SERVER-31755 was implemented in 4.0.6 to allow the $lookup stage to raise the document size during its operation. If you upgrade your binaries to 4.0.6, can you use that feature to get around your issue?

Thank you,

Danny

Generated at Thu Feb 08 04:54:25 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.