-
Type:
Question
-
Resolution: Done
-
Priority:
Major - P3
-
None
-
Affects Version/s: 4.0.5
-
Component/s: Aggregation Framework, Index Maintenance, Performance
-
None
-
None
-
3
-
None
-
None
-
None
-
None
-
None
-
None
Hello,
I'm facing the `16Mb doc size` hard limit on an aggregation using a `$lookup`.
I have two collections: `tag` and `video`
A `tag` looks like this:
{ "_id" : ObjectId("5c81217506b84b3e13cc106c"), "..." "created_at" : ISODate("2015-03-02T09:51:26.315Z"), "updated_at" : ISODate("2015-03-02T09:51:26.315Z"), }
A `video` looks like this:
{ "_id" : ObjectId("5c81223606b84b3e13cca867"), {...}, "tags" : [ { "_id" : ObjectId("5c81217606b84b3e13cc27e5"), {...} }, { "_id" : ObjectId("5c81217506b84b3e13cc148c"), "publish_id" : "4f10596a-1a7c-4a18-b81d-0c33c4887425", "name" : "jenifer the voice", "label" : "Jenifer The Voice", "slug" : "jenifer-the-voice" }, ] }
So a `video` can have multiple tags (embded documents) and a `
tag` can be used in multiple `videos`
In `tag` collection, I need to know how much `video` are related to a tag and (for now) I'm constrained to calculate this data everytime I do a find on my interface (yep that's not really cool).
So I added an index on `video`collection for `tags._id` and made the following aggregation:
db.getCollection('tag').aggregate([ {"$match": {"_id": {"$in": [ ... ]}}}, {"$lookup": { "from": "video", "localField": "_id", "foreignField": "tags._id", "as": "videoTags"}}, {"$project": { "_id": 1, "name": "$name", "publish_id": "$publish_id", "label": "$label", "slug": "$slug", "created_at": "$created_at", "updated_at": "$updated_at", "sort": "$sort", "archived": "$archived", "parent_id": "$parent_id", "nbRelatedVideos": {"$size": "$videoTags"} }} ])
That one works really great until I have too much videos related to a tag ...
It appears that the lookup will set the entire `video` documents in a `videoTags` field ... so it can take a lot of space. As I only need to count the number of videos, I just need one field to count on (like `_id`).
Maybe there is a way I can count that directly in the `$lookup` ?
After some researches, I get to find a solution not to encounter the 16mb hard limit ... but it is clearly not fast as `$expr` does not use multiKey indexes:
db.getCollection('tag').aggregate([ {$match: {"_id": {$in: [ ... ]}}}, {$lookup: { from: "video", as: "videoTags", let: { "tagId": "$_id" }, pipeline: [ {$unwind: "$tags"}, {$match: { $expr: { $eq: ["$tags._id", "$$tagId"] } }}, {$project: {"_id": true}} ], // "localField": "_id", // "foreignField": "tags._id", }}, {"$project": { "_id": 1, "name": "$name", "publish_id": "$publish_id", "label": "$label", "slug": "$slug", "created_at": "$created_at", "updated_at": "$updated_at", "sort": "$sort", "archived": "$archived", "parent_id": "$parent_id", "nbRelatedVideos": {"$size": "$videoTags"} }}, ])
Is there any workaround to improve perfs here ?
Maybe there is a way I can do the count directly in the `$lookup` ?
I hope the question is clear enough, do not hesitate to ask for more information if needed.