[SERVER-3310] Query optimizer should efficiently handle $in and sort with compound index. Created: 21/Jun/11  Updated: 12/Jul/16  Resolved: 13/Jan/14

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: None
Fix Version/s: 2.5.5

Type: Improvement Priority: Major - P3
Reporter: Kyle Banker Assignee: Hari Khalsa (Inactive)
Resolution: Fixed Votes: 51
Labels: query_triage

Attachments: File server3310.js    
Issue Links:
Depends
depends on SERVER-5063 $in on first compound key element and... Closed
is depended on by SERVER-10472 find(a).sort(b,_id) doesn't use index... Open
Duplicate
duplicates SERVER-959 never do a table scan to answer db.f.... Closed
is duplicated by SERVER-2148 $in with sort doesnt use index Closed
is duplicated by SERVER-5032 unnecessarily large nScanned occurs d... Closed
Related
related to SERVER-7742 Improve the way indexes are used to d... Closed
related to SERVER-3758 push limit down into index scan for s... Open
related to SERVER-8790 Introduce composable "stages" in quer... Closed
is related to SERVER-11116 cursor limit has inconsistent effect ... Closed
Participants:

 Description   

Suppose you have an index on

{friend_id: 1, date: -1}

Then, suppose you have a query like this:

db.things.find({ friend_id: {$in: [lots_of_ids]}}).sort(

{date: -1}

).limit(20)

Currently, this does a scanAndOrder and scans quite a few documents. Can we optimize?



 Comments   
Comment by Eliot Horowitz [ 22/Jun/11 ]

For this case:

Index: ( a , b )
Query:
find( { a :

{ $in : [ ... ] }

} ).sort( b )

We should do a merge sort on each $in element.

Comment by Saurabh Agrawal [ 22/Jul/11 ]

What's the target relase or expected time for this fix?

Comment by Ayush Ghai [ 25/Jul/11 ]

Don't need to do a full merge sort if limit is significantly less than total possibilities. The task at hand is to create a priority queue over the heads of multiple $in queues and then select limit number of elements from the priority queue.

This will require very little coding on your end am sure. For a social network news feed, this query becomes very important. Imagine generating news feed of 20 items for a person with 300 friends! There are two options in such a case

1.Make 300 diff queues on the client by fetching those from the DB. Pull top N(20) based on the timestamp on the client side(basically do the server's job till this bug is fixed). What is bad is that there will be 300 hits to the DB! I don't like this much.
2.Index a and b separately. Query by range on b - find( { a :

{ $in : [ ... ] }

,b>somevalue} ). Keep shifting the range for b and repeat the same when you have to 'show more'. This return documents grouped by 'a' and within that, sorted by 'b'. Benefit over first query - Only one hit to the DB. Tradeoff - Will have to load the whole result set in memory. And then pick the top N. Means more data transfer still.

Both solutions are workarounds. I wonder how other people do this for the social network feeds.

Will be great if this 'small' fix is added soon

Comment by Saurabh Agrawal [ 25/Jul/11 ]

By the way, how bad is the first option? For example, taking the above mentioned case of a social network news feed, where a person on an average has, let's say, 100 friends, and this merging operation is carried out at runtime. Is it a viable option at all in such a case?

Comment by James Smith [ 29/Nov/11 ]

Is this definitely going to make it into 2.1? Would be a great help for reducing overall index size on a collection.

Comment by Eliot Horowitz [ 29/Nov/11 ]

@james - not definite, but hopeful

Comment by Milan Gornik [ 07/Dec/11 ]

Upgrading from Mongo 1.8 to Mongo 2.0 caused a performance issue on our project. Since more things are loaded from disk, even if they are present in multikey index, this produced high nscannedObjects for us and killed performance of several queries. Before, it was working fully in memory for us (whole index fits in the memory), and now it works partially from memory and partially from disk. Here's the group discussion: http://groups.google.com/group/mongodb-csharp/browse_thread/thread/dec2f5ac803c25d1

Implementation of this task would probably fix the issue for us, in the meantime we might consider reverting back to Mongo 1.8. So, I give my vote for getting a fix for this issue into next MongoDB release. Thanks!

Comment by Eliot Horowitz [ 24/Feb/12 ]

SERVER-5063 is not a full solution for this - but will make this considerably better in most cases and is a lot simpler than a complete answer.

Comment by James Devine [ 17/May/12 ]

I am having a similar issue with this query:

db.messages.find({"$and" :[{"search_terms" : {"$all" : [ /^From:facebookmail com / ]}},

{"chunk" : 0}

]}).sort(

{timestamp: -1}

).limit(400)

I have an index that includes both search_terms, which is an array of strings, and timestamp.
I also created a separate index on timestamp on advice.

An explain on the query above shows that it uses both indexes, probably the first for search and the second for sort, and it scans 10122 of the 25000+ objects that match to find the first 400, taking 30 seconds to do so

Comment by Aaron Staple (Inactive) [ 17/May/12 ]

James - Your regex query will match a range of values on the search_terms field. I think the most relevant open ticket is SERVER-3758.

Comment by David Marquez [ 21/Sep/12 ]

I tried a few different arrangements of this multi-value query against a compound index with sort on the second key. I tried $in and $or and they show the same issue. I looked at the explain() and I can tell it is not using the index. Not even with hint(). We are using 64-bit version 2.2.0 on linux.

Comment by Aaron Staple (Inactive) [ 21/Sep/12 ]

Hi David - if you provide more details on your experiments and post them to our google group <http://groups.google.com/group/mongodb-user> someone can help you there.

While this ticket isn't implemented yet, a subset of the relevant cases will be optimized by the SERVER-5063 work in 2.2.

Comment by Daniel Kim [ 05/Nov/12 ]

Hi guys,

Can we make sure that this also works for queries that have multiple $in conditions? For instance, the core query for my application looks like this:

index:

(a, b, c, d)

query:

find({
a :

{ $in : [ ... ] }

,
b :

{ $in : [ ... ] }

,
c :

{ $in : [ ... ] }

}).sort( d ).limit( n )

This query should also be able to make use of the compound index to avoid full scans while sorting.

Also, this should work with more complex queries as long as the other components of the query do not interfere with the use of the index for sorting. For example, I also have a critical query that adds an $all condition to the query above, i.e.:

index:

(a, b, c, d, e)

query:

find({
a :

{ $in : [ ... ] }

,
b :

{ $in : [ ... ] }

,
c :

{ $in : [ ... ] }

,
d :

{ $all : [ ... ] }

}).sort( e ).limit( n )

Comment by Michael Häusler [ 28/May/13 ]

This optimization is really critical in many use cases. E.g., find the 10 latest movies of your favorite actors.

{
  "title" : "MongoDB - The Movie",
  "actorIds" : [7, 25, 42],
  "date" : ISODate("2013-05-23T00:00:00Z")
}

db.movies.ensureIndex({ "actorIds" : 1, "date" : -1 });
 
db.movies.find({ "actorIds" : { "$in" : [42, 451] } }).sort({ "date" : -1 }).limit(10);

For this use case, it is important that the optimization will also work with a compound multikey index.

Comment by Hari Khalsa (Inactive) [ 13/Jan/14 ]

See SERVER-1205

Comment by Edward [ 27/Apr/14 ]

It seems that at current version (2.6.1), the query optimizer is still unable to optimize queries that use $or, $in, limit() and sort() all at once. The https://jira.mongodb.org/browse/SERVER-1205 and https://jira.mongodb.org/browse/SERVER-3310 fixes, each only improved performance on queries having 3 out of the 4 operations listed above. When introducing a 4th operation into the query, the optimization goes out the window. This behavior is observed with full index and document scans within the $or clause, even though a limit() is specified.

Comment by James Hartig [ 07/May/14 ]

Any chance we can get this backported to 2.2.x?

Comment by Eric Milkie [ 07/May/14 ]

Unfortunately, the query optimizer code is completely different in 2.2.x, so backporting will not be possible.

Generated at Wed Nov 22 14:35:13 UTC 2017 using JIRA 7.2.10#72012-sha1:2651463a07e52d81c0fcf01da710ca333fcb42bc.