[SERVER-37470] Lookup sub-pipeline is not using index with the $in operator Created: 04/Oct/18  Updated: 27/Jun/23

Status: Backlog
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Marco Assignee: Backlog - Query Optimization
Resolution: Unresolved Votes: 14
Labels: storch
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-45326 Poor Aggregation Framework performanc... Closed
is related to SERVER-32549 allow $expr with $in expression use a... Backlog
Assigned Teams:
Query Optimization
Participants:

 Description   

Description

Suppose we have the following documents in the emails collection:

[{_id: ObjectId(), subject: 'Hello there', to: ['address1@example.com', 'address2@example.com']}]

With the index:

 { { "v" : 2, "key" : { "to" : -1 }, "name" : "to_-1", "ns" : "test.emails" }

And the following documents in the collection users:

 [{_id: ObjectId(), name: 'John Doe', address: 'address1@example.com'}]

Then, if we use the following pipeline on 'users' in order to insert all emails sent to our users into 'sentEmails':

db.users.aggregate([
  {$lookup: {
    from: 'emails',
    let: {userAddress: '$address'},
    as: 'sentEmails',
    pipeline: [
      {$match: {$expr: {
        $in: ['$$userAdress', '$to']      
      }}}
    ]
  }}
])

Or, even if we extracted the emails sub-pipeline into an equivalent pipeline for our first user (John Doe):

db.emails.aggregate([
 {$match: {$expr: {
   $in: ['address1@example.com', '$to'] 
 }}}
])

 

Its not using the index on the emails collection.

 

However, if you do the same without using the $expr, the index is used properly:

db.emails.aggregate([{$match: {to: 'address1@example.com'}}], {explain: true}) 
...
 "winningPlan" : { "winningPlan" : { "stage" : "FETCH", "inputStage" : { "stage" : "IXSCAN",
...

Observed results

The index is not used when the $in operator is used inside an $expr on a $match

Expected results

The index should be used to perform the $match stage



 Comments   
Comment by BaLL Tse [ 27/Jun/23 ]

Hello asya.kamsky@mongodb.com 

a topic created in the from:

topic

Thanks.

Comment by Asya Kamsky [ 26/Jun/23 ]

ball.tse@kuehne-nagel.com it's difficult to answer without knowing exactly what your pipeline is.  Maybe ask specific question with sample documents and sample pipeline in the community forums? (https://www.mongodb.com/community/)

Comment by BaLL Tse [ 26/Jun/23 ]

Hello james.wahlin@mongodb.com,

    Are there any work around if there are multiple $in criteria?

Thanks.

Comment by Marco Catalan [ 06/Sep/21 ]

Hello James. Those are great news to hear, this feature also solves our original use case for this issue, thanks!

Comment by Joerg Baier [ 03/Sep/21 ]

Thanks, that solves our concerns 

Comment by James Wahlin [ 03/Sep/21 ]

I would like to advertise that in MongoDB 5.0 you can now use localField/foreignField syntax along with a pipeline specification (SERVER-34927). joerg.baier@anthemiq.com with this change you could now build the following $lookup stage, taking advantage of the localField/foreignField indexing behavior:

{  
   $lookup: {    
      from: "suites",
      localField: "suiteIds",
      foreignField: "_id",
      pipeline: [ 
         ... filter/project/count/etc ...
      ],
      as: "suites"
   }
}

Comment by Joerg Baier [ 02/Sep/21 ]

My team is experiencing the same issues as others have mentioned.

We use this style of lookup wherever possible, which uses the index on the "from" collection.

{
  $lookup: {
    from: "suites",
    localField: "suiteIds",
    foreignField: "_id",
    as: "suites"
  }
}

We have many use-cases for the sub-pipeline, however this does not use the index on the "from" collection.

{
  $lookup: {
    from: "suites",
    let: { ids: "$suiteIds" },
    pipeline: [
      { $match: { $expr: { $in: ["$_id", "$$ids"] } } },
      ... filter/project/count/etc ...
    ],
    as: "suites"
  }
}

 
I don't understand why this has not gotten more attention. I was very surprised by this behavior. There are many people who do not realize they are writing an inefficient aggregation, even though the query is equivalent in theory. This becomes incredibly slow as "suites" becomes a large collection.

 

Comment by Marco Catalan [ 25/Nov/19 ]

If I'm understanding correctly, SERVER-32549 is tracking something very similar, but only in the case when the value  you are using the $in is not an array. This issue is for when value is an array. I don't know if internally they require different fixes or not, but if you consider SERVER-32549 would fix this behavior you can close as a duplicate, just please update SERVER-32549 to indicate the index should also be used when the field is an array

Comment by Asya Kamsky [ 24/Nov/19 ]

Ok but that’s just duplicate of SERVER-32549 - is there any way that this ticket is unique and shouldn’t be closed as a duplicate?

Comment by Marco Catalan [ 21/Nov/19 ]

Hello Asya,

 

The lookup we are trying to do returns many elements, and we just want to get the number of elements returned, not the whole document. That is only possible using the expressive $lookup and doing a $count as a last stage, since if you try to use the classic $lookup, you will exceed mongodb document size.

Another use case would be to get small projections of the looked up data, specially when dealing with a big number of looked up items, so you can avoid hitting the maximum document size

Comment by Asya Kamsky [ 21/Nov/19 ]

Every example given in this ticket already should be using regular $lookup with localField and foreignField which reaches into arrays and will use indexes so to keep this ticket open we need an example that actually requires use of expressive $lookup.

Comment by Marco [ 15/Oct/18 ]

I think that I should have opened this ticket as an improvement instead of a bug.

As you say, it would only work then to is an array, but why would that be a restriction for the index? I don't know how mongodb internally uses indexes, but it sounds like it would be a sub-case of the index that the query without the $expr uses, since the field 'to' is also an array.

Also, wouldn't the pipeline raise an exception if 'to' is not an array?

 

Comment by Asya Kamsky [ 12/Oct/18 ]

We can't just use an index for this because in $expr you have, it's only true when "to" is an array.

Comment by Nick Brewer [ 04/Oct/18 ]

Mcat Thanks for your report. We've marked this ticket as related to SERVER-32549 which tracks similar work for non-multikey indexes. I've passed this along to our query team for consideration.

-Nick

Generated at Thu Feb 08 04:46:05 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.