[SERVER-6720] Range query with compound multikey index scans more objects than needed Created: 06/Aug/12 Updated: 11/Jun/19 Resolved: 11/Oct/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Index Maintenance, Querying |
| Affects Version/s: | 2.0.5, 2.0.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Sergejs Degtjars | Assignee: | hari.khalsa@10gen.com |
| Resolution: | Won't Fix | Votes: | 3 |
| Labels: | indexing, query | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Windows, Linux |
||
| Attachments: |
|
||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Description |
|
I noticed very strange behavior IMO when range queries used with compound index where one of indexed fields is array.
When I use simple key index or compound with simple type fields query optimizer runs as expected:
But when I use compound index with array field, $lte condition is simply ignored, and all index entries from lower bound is scanned
In this case count of scanned objects is bigger that count of actual documents, and query took lot of time. |
| Comments |
| Comment by David Storch [ 29/Apr/14 ] | |||||||||||||||||||
|
xres74, good point! In your example the field happens not to be an array field, and hence it would be safe to intersect the bounds. However, the query planner has to construct the index bounds without looking at the data. The planner looks in the index catalog in order to determine the available indices and whether these indices are multikey, but the catalog does not indicate which fields are array fields in a compound multikey index. Suppose we have a multikey index {a: 1, b: 1, c: 1}. The query planner has no idea whether "a", "b", or "c" is the array field. In order to be fully correct, it therefore will never intersect index bounds for any of these three fields. This begs the question "Why doesn't the index catalog store which field of an index is multikey?" The answer is that this is allowed to vary between documents. For the multikey index {a: 1, b: 1, c: 1} we might have documents
This means that for some documents it might be safe to intersect index bounds, but for others the field is an array and we cannot intersect the bounds. In order to be completely sure that it is safe to take the intersection, the query planner would have to inspect the entire collection. This is arguably a design flaw---it would be really nice, for instance, to enforce constraints such that only a particular field of a multikey compound index is allowed to be an array. But for now we live in a world where it is always considered incorrect to intersect bounds for a multikey index (without $elemMatch). I hope that helps to clear up the behavior as it stands. asya, any other ideas for a workaround that may help Edward? | |||||||||||||||||||
| Comment by Edward [ 29/Apr/14 ] | |||||||||||||||||||
|
Thank you, David and Asya for explaining. The example you gave is very clear, and it illustrated why a ranged query made on an array field would sometimes yield inaccurate results. And because of this, there was a decision to not use boundaries when a multikey index is involved. However, in a compound multikey index, as is in my case, the range condition is not done on the array part of the index, thus should not have the same problem as illustrated in your example. Is the query engine able to identify which field of the compound multikey index is the array field? If it is able, then it should be able to determined that it can apply boundary conditions when the boundary is specified for a non-array field, and ignore it when it is. I have tried to explore possible workarounds, and for this issue, I could not come up with one, except altering the original intended feature in my application. In other words, this is a blocking issue for me. | |||||||||||||||||||
| Comment by David Storch [ 28/Apr/14 ] | |||||||||||||||||||
|
Hi xres74, Here is an explanation for why you are seeing odd index bounds which only appear to incorporate one of your query predicates over the array field. I'll start with an example. Let’s say we have the document {a: [5, 7]}. This document satisfies the query {$and:[{a: {$gt: 6}}, {a: {$lt: 6}}]}. For the index {a: 1}, the index keys will be 5 and 7. If we were to intersect the index bounds, we would end up with an empty interval (the intersection of [-Infinity, 6) and (6, Infinity] is empty) and hence would return no results. Instead, this query is executed by scanning the index over one of the ranges, say [-Infinity, 6) and then filtering the resulting documents by {a: {$gt: 6}}. The general principle here is that for a multikey index, index ranges cannot be intersected. Intersecting the ranges is only valid if a single index key must satisfy both simultaneously. But for multikey indices, queries like {$and:[{a: {$gt: 6}}, {a: {$lt: 6}}]} make sense, where the expectation is that different query predicates match different index keys. The $elemMatch operator is used to say that the same index key must simultaneously satisfy the predicates. This is why you see that the index bounds are intersected when you add an $elemMatch. A possible workaround would be to build an index on the 'ca' field that is not compounded with an array field. This way the index will not be multikey, and bounds should be intersected as you expect. | |||||||||||||||||||
| Comment by Asya Kamsky [ 28/Apr/14 ] | |||||||||||||||||||
|
Unfortunately I think the work-around was last relevant back in 2.2 (or 2.0?) - we did note that $elemMatch didn't match things correctly - sorry I think maybe the comment in this ticket should have been amended at that point. | |||||||||||||||||||
| Comment by Edward [ 28/Apr/14 ] | |||||||||||||||||||
|
It makes sense that $elemMatch works only on array fields, as that is what it was designed to operate on. So when I first read about the work around by Asya Kamsky suggesting the use of $elemMatch, I was a bit confused. In my case, the range condition is not coming from the multikey field. If I query with the following:
where owner is the multikey field, I would get the upper bound of the range to be something like 1.7976931348623157e+308, which basically causes the query to scan all documents greater than lower bound, which is not how it should be. And this issue is what brought me to this ticket, because my problem seemed similar to the issues reported in this ticket. When I added the $elemMatch on the 'ca' field, which is not a multikey field, the upper and lower bounds reported in explain() for 'ca' is now correct. So even though it made no sense to me, it at least seemed to have worked on fixing the query range boundaries. But with that, I get n:0, nscannedObjects:2, nscanned:2 from my explain(). This looks to me that it is using the index correctly, found the documents that are suppose to be matches and scanned them, but for some reason, didn't think they are matches. If I can't use $elemMatch as a work around, then what can I do to specify a range on a non-multikey field, in a query that involves a multikey field index? Currently, the upper bound will not be used even though it is specified. So is this the same issue as described in this ticket, or is it a different issue that needs to be looked at? | |||||||||||||||||||
| Comment by David Storch [ 28/Apr/14 ] | |||||||||||||||||||
|
Hi xres74, I've looked into your query and concluded that this is working as designed. The reason is that the $elemMatch query operator will only match array fields. Here is a simple example that shows this behavior:
Although the document with "_id: 0" has a value for "a" which satisfies {$gt: 0, $lt: 2}, this document does not match because "a" is a non-array field. On the other hand, the document for which "a" is an array field does match. The matching semantics work the same way regardless of what indices are available. If we build an index on "a" and repeat the query, we get the same result set:
I hope this explanation is helpful. Please let us know if you have any further questions. | |||||||||||||||||||
| Comment by Edward [ 27/Apr/14 ] | |||||||||||||||||||
|
The query is:
The index used is: designate_1_ca_-1_owner_1 In the data, the 2 records that match are:
P.S. this is done with PHP using the latest PHP driver as of today (2014/4/27). | |||||||||||||||||||
| Comment by Asya Kamsky [ 27/Apr/14 ] | |||||||||||||||||||
|
What is your data and your query? | |||||||||||||||||||
| Comment by Edward [ 27/Apr/14 ] | |||||||||||||||||||
|
As of 2.6.1, $elemMatch does make the boundaries correct, but it is not matching the records correctly. "cursor": "BtreeCursor perm_list_1_ca_-1_owner__id_1", | |||||||||||||||||||
| Comment by Daniel Pasette (Inactive) [ 11/Oct/13 ] | |||||||||||||||||||
|
As Asya said, use $elemMatch to do what you are looking for. The reason we can't intersect the two bounds for HierarchyLevel is that we don't know which key is an array in the multikey index, therefore one entry in the array could satisfy one predicate, and the other could satisfy the other. If we intersect the bounds we will not correctly match on the document. | |||||||||||||||||||
| Comment by Asya Kamsky [ 18/Sep/13 ] | |||||||||||||||||||
|
This would be easier with a specific query example but basically you can work around the "too wide a range" used by using $elemMatch when you have a range (even when the range is NOT on the array field). So an example based on original query in this ticket would be to rewrite it as:
This will cause the correct boundaries to be used. | |||||||||||||||||||
| Comment by Jon Hyman [ 17/Sep/13 ] | |||||||||||||||||||
|
We had been working with ObjectRocket who had a ticket open (I believe CS-8749) for this issue which should have more details, which was then confirmed to be due to this bug. We actually spent all of yesterday rewriting our code to move our index off of the array field we had since we hit this bug: what had happened was that until last week, the first part of the indexed field was always an array of size 1. Once one of the documents was updated to be an array of size 2, we saw this bug with the index upper range being something like 1.7976931348623157e+308. We literally had a 100x speed up after the code rewrite (queries went from taking over 37 seconds to about 300ms because the former was scanning 10s of millions of extra records). I can't find one of the explains off-hand, ObjectRocket and I were discussing this and sending around explains() in Hipchat which I don't have a copy of. If you can't find info on the internal ticket mentioned above, let me know and I'll contact OR to get the explains() from their Hipchat history. | |||||||||||||||||||
| Comment by Asya Kamsky [ 17/Sep/13 ] | |||||||||||||||||||
|
jon@appboy.com can you include your query and explain? depending on exact structure there may be a work-around. If it turns out to need a discussion, that may be better handled via google group. | |||||||||||||||||||
| Comment by Jon Hyman [ 16/Sep/13 ] | |||||||||||||||||||
|
We just started getting hit by this bug and it's causing an incredible amount of slowness in our app. Is there any way to get this done sooner than 2.5.3, or a potential workaround? | |||||||||||||||||||
| Comment by Edouard Servan-Schreiber [ 12/Jun/13 ] | |||||||||||||||||||
|
this does not seem to be resolved in 2.4.3 nor 2.4.4. | |||||||||||||||||||
| Comment by Aaron Staple [ 07/Mar/13 ] | |||||||||||||||||||
|
Hi Sergejs - I believe this has been resolved with the |