[SERVER-6099] Advanced Queries $gt and $lt take longer in 'indexed' collection Created: 15/Jun/12  Updated: 11/Jul/16  Resolved: 26/Jun/12

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: 2.0.5
Fix Version/s: None

Type: Question Priority: Critical - P2
Reporter: btd5nerds Assignee: Kristina Chodorow (Inactive)
Resolution: Done Votes: 0
Labels: query
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

linux


Attachments: File indexOr.bmp     File indexOr2.bmp     File no index.bmp    
Participants:

 Description   

My collection has 1 million data.

I am performing this query to retrieve
x > 70000 or x < 10000

In case 1, the pid (player id) of the collection isn't indexed.
a) if i query with {$or:[{pid:{$gt:700000}},{pid:{$lt:100000}}]},
it took around 750ms
b) if i query with {$or:[{pid:{$lt:100000}},{pid:{$gt:700000}}]}, (simply change the order of expression)
it took around 800ms

Everything is still ok.

BUT in case 2, i indexed pid, which supposed to provide a faster query. However,
a) if i query with {$or:[{pid:{$gt:700000}},{pid:{$lt:100000}}]},
it took around 980ms
b) WORSE, if i query with {$or:[{pid:{$lt:100000}},{pid:{$gt:700000}}]},
it took around 1600ms!!

I have 2 questions here:
1. Why the indexed collection takes longer?
2. Why different expressions could have such a big difference in the time of query? ( does it have something to do with $or?)



 Comments   
Comment by Kristina Chodorow (Inactive) [ 26/Jun/12 ]

Great!

Comment by btd5nerds [ 26/Jun/12 ]

oO. This makes everything clearer!

Kristina, thanks a lot! This can be considered resolved.

Comment by Kristina Chodorow (Inactive) [ 26/Jun/12 ]

You're welcome!

Sorry it's taken me a while to get back to you, I've been trying to figure this out.

The likelihood of a query yielding is proportional to how long it runs and the two $or clauses are treated as separate queries. Later $or clauses can be slower than earlier ones because they do the de-duping. So, if >700k is slower and second, it is more likely to hit the yield threshold.

Also, yielding will only occur when there's a write waiting to happen. I'm not sure what your system is doing, but that might be a factor.

Comment by btd5nerds [ 22/Jun/12 ]

oO. That's a great info.
Right, that 'yield' is probably the culprit in my experiments.
From observation, case 2) would tend to have more yield than case 1) (any idea?)

Anyway, thanks for the explanation.

Comment by Kristina Chodorow (Inactive) [ 22/Jun/12 ]

Sure! MongoDB has fairly coarse locking (in 2.0, one read/write lock per process, in 2.2 it'll be one R/W lock per database). To prevent writes from getting stuck behind long-running reads, queries occasionally yield the read lock so some writes can go, then they'll grab it again. The nyields field in explain() output shows you how many times a query yielded the read lock.

Comment by btd5nerds [ 22/Jun/12 ]

oO. I think you have just mentioned the term that's unfamiliar to me: yield. Could you elaborate a little bit what is yield? I have searched around but didn't get a clear picture about it. Thanks in advance.

Comment by Kristina Chodorow (Inactive) [ 22/Jun/12 ]

> "2%-40%"
> This is quite a huge range =.=

Yeah, it varies on data/access patterns/query/indexes, sorry :-/ I'd generally time both for anything returning more than 20%.

> I thought they are supposed to consume the similar period of time, but they don't (and the difference is 2 > times). Is this expected?

At least in the example you gave, (2) yields once, which would explain the difference in time. Does (2) consistently yield (nyields>0) and (1) does not? If it does not always yield, is the difference in query time present when nyields=0?

Comment by btd5nerds [ 22/Jun/12 ]

First, thanks for the reply.

"2%-40%"
This is quite a huge range =.=
By the way, (apart from the index issue), the thing is, if i use different approach of querying the same range of data:
1) >700k OR <100k
2) <100k OR >700k
I thought they are supposed to consume the similar period of time, but they don't (and the difference is 2 times). Is this expected?

Comment by Kristina Chodorow (Inactive) [ 21/Jun/12 ]

It looks like the issue is that using an index is less efficient that a table scan when you're returning a large hunk of your data set, which is a known limitation of databases in general (relational and non-relational). The overhead of going from index entry to doc for each element match is not worth the overhead once you're returning a certain percent of your data (estimates vary from 2%-40% of your data... kind of depends on what you're doing). You can use .hint({$natural:1}) to force a table scan.

Comment by btd5nerds [ 18/Jun/12 ]

fyi, the 'gcid' in the image is actually the 'pid' (i just rename it)

Comment by btd5nerds [ 18/Jun/12 ]

Hi Kristina,

I have uploaded 3 images:
a) no index.bmp

  • this is the performance of querying if the collection is without any index.
  • the left side shows db.coll.find({$or:[{pid:{$gt:700000}},{pid:{$lt:100000}}]}).explain()
  • the right side shows db.coll.find({$or:[{pid:{$lt:100000}},{pid:{$gt:700000}}]}).explain()

b) indexOr.bmp

  • this is the querying of the indexed collection
  • db.coll.find({$or:[{pid:{$gt:700000}},{pid:{$lt:100000}}]}).explain()

c) indexOr2.bmp

  • this is the querying of the indexed collection
  • db.coll.find({$or:[{pid:{$lt:100000}},{pid:{$gt:700000}}]}).explain()

The difference of b) and c) can grow very large if collection is expanded.
I have tried inserting 20M data. The case c) takes 2 times longer than case b).

Comment by Kristina Chodorow (Inactive) [ 15/Jun/12 ]

Can you do an explain on each of the options? E.g.,

> db.coll.find({$or:[{pid:{$gt:700000}},{pid:{$lt:100000}}]}).explain()

and so on?

Generated at Thu Feb 08 03:10:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.