[SERVER-10472] find(a).sort(b,_id) doesn't use index(a,b) to minimize scans Created: 09/Aug/13  Updated: 18/Dec/23

Status: Backlog
Project: Core Server
Component/s: Index Maintenance, Querying
Affects Version/s: 2.4.4
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Tim Haines Assignee: Ben Shteinfeld
Resolution: Unresolved Votes: 3
Labels: query-44-grooming, query-offsite, query_triage
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

OSX, MongoLabs


Attachments: File server10472.js    
Issue Links:
Depends
depends on SERVER-3310 Query optimizer should efficiently ha... Closed
Related
is related to SERVER-12533 Include sort (spec) for sort stage in... Closed
Assigned Teams:
Query Optimization
Participants:

 Description   

I have a collection of documents with keys _id, a, and b. I have an index {a:1,b:1}.

When I do find(a).sort(b).limit(5) the index is used, and only 5 documents are scanned.

When I do find(a).sort(b,_id).limit(5) (with or without a hint) it appears the index is only used to match a - as all documents with the matching value for a are scanned (nscanned and nscannedObjects are high).

The behavior I would expect is that it would scan through a and b values enough to get to the limit (5), and then it would stop scanning when it found the next unique b (or a). It would then sort on _id within the scanned records.



 Comments   
Comment by Michael Longo [ 25/Feb/22 ]

If I may add, I'm looking to upgrade my current mongodb 4.2 (yes I know, I'm late) and in the 4.4 releases note it states:

$sort Changes
Starting in MongoDB 4.4, the sort() method now uses the same sort algorithm as the $sort aggregation stage. With this change, queries which perform a sort() on fields that contain duplicate values are much more likely to result in inconsistent sort orders for those values.
To guarantee sort consistency when using sort() on duplicate values, include an additional field in your sort that contains exclusively unique values.
This can be accomplished easily by adding the _id field to your sort.

 Which is kind of a pain due to this ticket. I'll basically have to update my indexes for a few corner cases duplicates that could be solved with a quick in-memory sort

Comment by Michael Longo [ 08/Feb/22 ]

The use case can be simplified a lot. We don't need a compound index or any filter. Just:

for (var a = 0; a < 500; a++) { db.test.insert({a:a}) }
db.test.ensureIndex({a:1})
db.test.find().sort({a:1,_id:1}).limit(10).explain('executionStats')

is enough to reproduce the "issue". Even though only 10 documents were asked, it ends up examining and sorting the whole 500 documents even though it would in reality require no sort at all in this very situation.

And even if we had duplicated entries, it should really only sort these duplicates instead of the whole collection. This is really a bummer for something that appears so simple and intuitive.

Just adding my 2 cents hopping this would get noticed and maybe prioritized?

Comment by David Storch [ 16/Aug/19 ]

I'm changing the issuetype from "Bug" to "Improvement". Capturing the steps to reproduce, since they have a very helpful example of a query that would benefit from this suggested optimization:

db.temp.insert({_id:1, a:1, b:1})
db.temp.insert({_id:2, a:1, b:2})
db.temp.insert({_id:3, a:1, b:3})
db.temp.insert({_id:4, a:1, b:4})
db.temp.insert({_id:5, a:1, b:5})
db.temp.insert({_id:6, a:1, b:6})
 
db.temp.ensureIndex({a:1, b:1})
 
// Scans 3 objects
db.temp.find({a:1}).limit(3).sort({b:1}).explain()
 
// Include _id as secondary sort field
// Scans 6 objects.  Should only need to scan 4  (3 + enough to reach next unique b value)
db.temp.find({a:1}).limit(3).sort({b:1, _id:1}).explain()

Comment by Asya Kamsky [ 16/May/18 ]

vkuznet

This is not the same issue as the issue described in this ticket is having compound index which is used for both find and sort, but it's not used in the most efficient manner when there is a second field adding to the sort specification.

Your issue is a known issue that two different indexes cannot be combined for find and for sort.   If you observe that it works with one driver but not another, I believe something else is actually an issue and it would be easiest to debug if you open a new SERVER ticket for it (if it does not work via the shell and it should).  If it works in one driver but not another then you should open a Jira ticket in the driver project (for the one it does not work in).

 

Comment by Valentin Kuznetsov [ 23/Apr/18 ]

For the record, I used MongoDB shell version v3.6.3 and MongoDB server version: 3.6.3

Comment by Valentin Kuznetsov [ 23/Apr/18 ]

Hi,
I hit this issue with mgo driver when I was looking at large collection of docs by applying one set of conditions (which were indexed) and sorted by another key (also indexed). The issue seems to be in MongoDB server. In pymongo I don't see it, while using mgo it exists. Here is a way to reproduce it (I used python code for injection):

from pymongo import MongoClient
 
uri = 'mongodb://localhost'
client = MongoClient(uri)
col = client['test']['db']
col.remove({})
 
col.create_index("hash")
col.create_index("dataset.name")
 
msg = "lksjdflksjdflsj"
nrec = 10000 # size of the bulk
for idx in range(100):
    res = col.insert_many(\
            [{"hash": msg, "dataset":[{"name": msg*10+str(idx*nrec+i), "foo":msg}] } \
            for i in range(nrec)])
    print(idx, res)
 
spec = {"hash": msg}
res = col.find(spec).sort("dataset.name", 1).count()
print("count found %s doc" % res)
res = col.find(spec).sort("dataset.name", 1)
print("first doc", res[0])

This will insert 1000000 docs. And, I am able to fetch the doc using my look-up query. But if I try to do that in mongo shell it fails:

> db.db.find({"hash":"lksjdflksjdflsj"}).sort({"dataset.name":1})
Error: error: {
        "ok" : 0,
        "errmsg" : "errmsg: \"Sort operation used more than the maximum 33554432 bytes of RAM. Add an index, or specify a smaller limit.\"",
        "code" : 96,
        "codeName" : "OperationFailed"
}

Obviously the query used look-up index as well as index on sorted field. I don't think it is optimization issue, since I provided both indexed fields and rather an issue with search implementation.

Comment by Matt Kangas [ 04/Feb/15 ]

This ticket is a request for a query optimization. Here is a brief outline what will be required to implement it.

During query planning we determine if the requested sort is satisfied by an index. Today this is a binary decision. If yes, we can skip an explicit sort stage. If no, we choose a sort strategy and add it to the query plan.

The optimization requested requires changing the answer from yes/no to yes/no/prefix. That is to say, the index partially satisfies the sort order.

Later, during query execution, the sort stage would need to apply logic to stop consuming input documents from the index scan once the limit is reached, then proceed with a normal sort.

This is a nontrivial amount of machinery to maintain and test. It is presently unclear that the result will be a significant win for many users, particularly when compared to other possible query optimizations.

Comment by David Hows [ 22/Aug/13 ]

Hi Tim,

Thanks for bearing with me, I see what you're saying.

There is potential for optimisation and we will give this further consideration.

Thanks again,
David

Comment by Tim Haines [ 22/Aug/13 ]

Hi David,

If the query for a specifies one value, and the sort for b is the dominant sort option, then why can't the scanner just look at <limit> documents and then scan up to the next change in value of b?

Can you provide a small dataset to illustrate why this doesn't work please?

Thank you,

Tim.

Comment by David Hows [ 19/Aug/13 ]

Thanks Tim,

Is the index still on

{a:1,b:1}

?

If so, then the same reasoning above will still apply. Given you are sorting on

{b:1, _id:1}

then Mongo will need to bring in all matching documents to sort them correctly to return the correct 3 documents (given sort order) as there is no other way to perform the sort on

{b:1,_id:1}

using the

{a:1,b:1}

index.

You would need to add an index which includes _id to sort correctly without scanning all matching documents; even with the limit. For instance

{a:1, b:1, _id:1}

would allow you to match and sort in this manner without scanning.

Regards,
David

Comment by Tim Haines [ 13/Aug/13 ]

Hi David,

I'm sorry, I screwed up the steps to reproduce (and it looks like I can't edit them now?).

The steps to reproduce should have shown a sort by

{b:1}

in the first example and

{b:1, _id:1}

in the example showing the problem. In the problem case Mongo should realize that after it's found the limit of documents request, it only needs to keep scanning until it finds a change in the value of b. In the case of the problem example but with a sort of

{b:1, _id:1}

, it should only need to scan 4 documents.

Comment by David Hows [ 12/Aug/13 ]

Hi Tim,

The issue in this is that MongoDB cannot assume that the _id values are sorted given the index of a,b.

This means that when you ask for MongoDB to find three documents sorted by a and _id it must scan all 6 documents to confirm the sort order. Once it has confirmed which are the three relevant documents given sort order it will return a cursor to those three documents.

This could be resolved by adding an index with the following definition:

db.temp.ensureIndex({a:1, _id:1})

With the explain result of:

db.temp.find({a:1}).limit(3).sort({a:1,_id:1}).explain()
{
	"cursor" : "BtreeCursor a_1__id_1",
	"isMultiKey" : false,
	"n" : 3,
	"nscannedObjects" : 3,
	"nscanned" : 3,
	"nscannedObjectsAllPlans" : 3,
	"nscannedAllPlans" : 3,
	"scanAndOrder" : false,
	"indexOnly" : false,
	"nYields" : 0,
	"nChunkSkips" : 0,
	"millis" : 0,
	"indexBounds" : {
		"a" : [
			[
				1,
				1
			]
		],
		"_id" : [
			[
				{
					"$minElement" : 1
				},
				{
					"$maxElement" : 1
				}
			]
		]
	},
	"server" : "Pixl.local:27017"
}

Regards,
David

Generated at Thu Feb 08 03:23:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.