[SERVER-32549] allow $expr with $in expression use a non-multikey index Created: 04/Jan/18  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: 3.6.0
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Asya Kamsky Assignee: Backlog - Query Optimization
Resolution: Unresolved Votes: 25
Labels: storch
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-36254 No index support for filtering on arr... Closed
is duplicated by SERVER-36994 $expr does not use index for $in Closed
is duplicated by SERVER-38138 $expr doesn't use indexes Closed
Related
related to SERVER-37470 Lookup sub-pipeline is not using inde... Backlog
related to SERVER-45326 Poor Aggregation Framework performanc... Closed
Assigned Teams:
Query Optimization
Participants:
Case:

 Description   

Similarly to SERVER-31760 it's possible to rewrite $expr with $in expression to an indexable MatchExpression.

Semantically

 {$expr:{$in:[ "value", [ "v1", "v2", "v3" ]] } } 

is exactly equivalent to

 {"value":{$in:[ "v1", "v2", "v3"]}} 

if value is not an array.

If we restrict using an index for $expr $in expressions to non-multikey indexes/paths, then we will only get back matches from the index that are guaranteed to be correct semantically to the aggregation meaning of $in.

This is also exactly equivalent query that can use any index and return only agg semantics:

db.reviews.find({product_id:{$in:["hat","shirt"],$not:{$elemMatch:{$exists:true}}}})

The second part prevents reaching into arrays for comparison.



 Comments   
Comment by Joerg Baier [ 02/Sep/21 ]

Sharing my concerns from a related issue, as this seems to be more popular.
https://jira.mongodb.org/browse/SERVER-37470?focusedCommentId=4037950&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-4037950

Comment by Asya Kamsky [ 21/Nov/19 ]

Sorry wrong ticket

Comment by Anders Cassidy [ 30/Sep/19 ]

The $unwind -> $match -> $group workaround does not scale well at all...performance is terrible with more than a few thousand rows

Comment by Sten [ 03/Nov/18 ]

Allowing $expr and $in to use non-multikey (and perhaps multikey) indices could be a huge performance improvement for relational queries with aggregation and $lookup stages (and nested lookup stages).
Take for example the following fictional collections: titles > messages > ranks:

Collection: titles

 

{
 "_id": "caab2c8b77b3eeafd7bbe1ed5663727a1636c649b7a5984375714c7406f3eaba",
 "created": 1541175268196,
 "updated": 1541175268196,
 "title": "test title",
 "messages": ["329cb83787a985d25a600c83de93665cf05958be3c37a35c3fa3148bdeb51029", "6f49f7c9da64eb3447a174431f9ea683a7eb1e595456717f8afa329f0be4a14f", "d1ce76f4b7ceedcfec85dc78f7e9bc92976740582b0779da6e4e7eefe528784f"]
}

  

Collection: messages

 

[{
 "_id": "329cb83787a985d25a600c83de93665cf05958be3c37a35c3fa3148bdeb51029",
 "created": 1541175268196,
 "updated": 1541175268196,
 "message": "test",
 "ranks": []
}, {
 "_id": "6f49f7c9da64eb3447a174431f9ea683a7eb1e595456717f8afa329f0be4a14f",
 "created": 1541175268196,
 "updated": 1541175268196,
 "message": "test2",
 "ranks": ["35d7ff35e58437055fcdfdc0bdd6c4dddf547f7ec5799e489873555707306880", "3336b8f1f4ceca3faf2d21c211655953807841d14df44c0e50c33d0df1f22849"]
}, {
 "_id": "d1ce76f4b7ceedcfec85dc78f7e9bc92976740582b0779da6e4e7eefe528784f",
 "created": 1541175268196,
 "updated": 1541175268196,
 "message": "test3",
 "ranks": ["cfc0367a8be9bf3c53566950488e2c30ee9f19a863cda4afc34d50a7b491c23d"]
}]

 

Collection: ranks

[{
 "_id": "cfc0367a8be9bf3c53566950488e2c30ee9f19a863cda4afc34d50a7b491c23d",
 "created": 1541175268199,
 "updated": 1541175268199,
 "ranking": 1,
}, {
 "_id": "3336b8f1f4ceca3faf2d21c211655953807841d14df44c0e50c33d0df1f22849",
 "created": 1541175268199,
 "updated": 1541175268199,
 "ranking": 4,
}, {
 "_id": "35d7ff35e58437055fcdfdc0bdd6c4dddf547f7ec5799e489873555707306880",
 "created": 1541175268199,
 "updated": 1541175268199,
 "ranking": 51,
}]

In order to aggregate messages and ranks within each title record we could use the following aggregation pipeline:
 

[
    {
        '$match': {
            _id: 'caab2c8b77b3eeafd7bbe1ed5663727a1636c649b7a5984375714c7406f3eaba'
        }
    },
    {
        '$lookup': {
            from: 'messages',
            let: {
                messages: '$messages'
            },
            pipeline: [{
                    '$match': {
                        '$expr': {
                            '$in': ['$_id', '$$messages']
                        }
                    }
                },
                {
                    '$lookup': {
                        from: 'ranks',
                        let: {
                            ranks: '$ranks'
                        },
                        pipeline: [{
                            '$match': {
                                '$expr': {
                                    '$in': ['$_id', '$$ranks']
                                }
                            }
                        }],
                        as: 'ranks'
                    }
                }
            ],
            as: 'messages'
        }
    }
]

There is a standard index for the "_id" field in each collection. It would be awesome if indices could be used for these kind of queries, because this would allow for relations of one-to-many in a traditional sense. 

It comes down to allowing indices for these kind of queries indeed (for example ranks collection):

Not using indexes (one-to-many) query:

{
    '$match': {
        '$expr': {
            '$in': ['$_id', ["35d7ff35e58437055fcdfdc0bdd6c4dddf547f7ec5799e489873555707306880", "3336b8f1f4ceca3faf2d21c211655953807841d14df44c0e50c33d0df1f22849"]]
        }
    }
}

 

It is worth noting that $match stages using $eq are infact indexed. This allows for high performance one-to-one relation searches:

Using indexes (one-to-one) query:

 

{
    '$match': {
        '$expr': {
            '$eq': ['$_id', "35d7ff35e58437055fcdfdc0bdd6c4dddf547f7ec5799e489873555707306880"]
        }
    }
}

 

Comment by Asya Kamsky [ 19/Oct/18 ]

There is a workaround for cases where this is an expressive "$lookup" on an array of scalars and that's to first $unwind and then use $eq rather than $in. It doesn't help with using $in+$expr outside of $lookup.

Comment by David Storch [ 09/Jan/18 ]

Could you provide guidance on priority? It's still not clear to me whether you believe that we need to schedule this work alongside 3.8 development.

Comment by Asya Kamsky [ 08/Jan/18 ]

I'm ok with it being in follow-on improvement though it's a pretty painful shortcoming of original implementation.

Comment by David Storch [ 05/Jan/18 ]

asya, I'm to suggest removing this from the "More expressive $lookup" project and tracking it as a follow-on improvement. Does that work for you?

Also, could you provide guidance on priority? Is this something we should schedule (and maybe even aim to backport to 3.6)?

Generated at Thu Feb 08 04:30:34 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.