[SERVER-4180] Allow multiple constraints (on same field) within a $elemMatch clause to be used with a multikey index Created: 31/Oct/11 Updated: 11/Jul/16 Resolved: 13/Jun/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Index Maintenance, Querying |
| Affects Version/s: | 2.0.1 |
| Fix Version/s: | 2.1.2 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Mathieu Poumeyrol | Assignee: | Aaron Staple |
| Resolution: | Done | Votes: | 2 |
| Labels: | regression | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
linux ubuntu |
||
| Attachments: |
|
||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||
| Description |
|
While bumping from 1.8 to mongo 2.0.1, we noticed some query optimization seemed not to work anymore. After some laborious digging in our data, we found out a single document showing the issue. Please open the attached .tgz. The tc.sh script will restore the document in a core.picture_set_pictures collection on a localhost:27017 mongo, setup an index, and explain a range query on a date embedded in a array of objects. The relevant section of the output shows. "indexBounds" : { "votes.created_at" : [ [ ISODate("2010-10-31T13:37:20Z"), ISODate("292278995-01--2147483647T07:12:56.808Z") ] ] }The upper bound is completely aberrant, and as far as I can tell, nothing in the document justifies such a value. As a consequence, on the real database, this will scan half of the table for every request. On the contrary, 1.8.3 was showing the right upper bound. Please also note that as far as we can tell, our collections have other instances of documents raising the problem (that is, a collection with just this document will exhibit the issue, but just removing the doc from the collection is not enough for the issue to disappear). |
| Comments |
| Comment by Mathieu Poumeyrol [ 13/Jun/12 ] | ||||||||||||||||||||||||||||||||||||||
|
This collection is big, and there are actually several ones impacted. I asked formally what was going to happen. Once I understood what was happenning, I gave you a testable case, and I POCed a patch. You guys told me there would be a fix for 2.2. Now, height months later, I'm stuck with a choice between a "maybe" and a huge and complicated data migration on several collections, including the biggest and most critical of my dataset. Please schedule 6050 for 2.2. I understand the rush for a 2.2 release... but come on. In the current state, seeing this qualified as "fixed" feels like a very bad joke on me. | ||||||||||||||||||||||||||||||||||||||
| Comment by Aaron Staple [ 13/Jun/12 ] | ||||||||||||||||||||||||||||||||||||||
|
Hi Mathieu - I suggest you change your schema slightly and add an elemMatch to your original query, per the following example. If
| ||||||||||||||||||||||||||||||||||||||
| Comment by Mathieu Poumeyrol [ 13/Jun/12 ] | ||||||||||||||||||||||||||||||||||||||
|
I'm sorry to be a pain here, but this fix is completely irrelevant to the issue. Worse, the implementation of the support for scalar values in $elemMatch being the root cause of the apparition of the regression in the first time, I fear these changes have brought us even farther from an actual resolution. I insist: this is a sneaky but major regression (a few other people have mentioned difficulties with the same problem on the irc channel) from 1.8. I sincerely hope we can have it fixed for 2.2. In any case, we definitely need to reopen it. | ||||||||||||||||||||||||||||||||||||||
| Comment by Aaron Staple [ 13/Jun/12 ] | ||||||||||||||||||||||||||||||||||||||
|
We've implemented an optimization allowing the $elemMatch operator to utilize multiple bounds on a multikey index. So if you do a query { a:{ $elemMatch: { $gte:0, $lte:10 }} } on index { a:1 }, only the a-values 0 through 10 will be scanned. This optimization doesn't apply to nested $elemMatch queries, for example { a:{ $elemMatch:{ b: { $gte:0, $lte:10 }} } on index { 'a.b':1 }. Since the $elemMatch operator is applied to a and not 'a.b', only one bound constraint can be applied to the values of 'a.b', and in this case all 'a.b' values >= 0 would be scanned. In the above situation you might consider a query like { 'a.b':{ $elemMatch: { $gte:0, $lte:10 } } } instead. Keep in mind though that (pending } } would match the documents { _id:0, a: { b:[ 5 ] }} and { _id:1, a:[ { b:[ 5 ] }] } but not { _id:2, a:[ { b:5 }] }. | ||||||||||||||||||||||||||||||||||||||
| Comment by auto [ 09/Jun/12 ] | ||||||||||||||||||||||||||||||||||||||
|
Author: {u'login': u'astaple', u'name': u'Aaron', u'email': u'aaron@10gen.com'}Message: | ||||||||||||||||||||||||||||||||||||||
| Comment by auto [ 09/Jun/12 ] | ||||||||||||||||||||||||||||||||||||||
|
Author: {u'login': u'astaple', u'name': u'Aaron', u'email': u'aaron@10gen.com'}Message: | ||||||||||||||||||||||||||||||||||||||
| Comment by auto [ 09/Jun/12 ] | ||||||||||||||||||||||||||||||||||||||
|
Author: {u'login': u'astaple', u'name': u'Aaron', u'email': u'aaron@10gen.com'}Message: | ||||||||||||||||||||||||||||||||||||||
| Comment by auto [ 09/Jun/12 ] | ||||||||||||||||||||||||||||||||||||||
|
Author: {u'login': u'astaple', u'name': u'Aaron', u'email': u'aaron@10gen.com'}Message: | ||||||||||||||||||||||||||||||||||||||
| Comment by auto [ 09/Jun/12 ] | ||||||||||||||||||||||||||||||||||||||
|
Author: {u'login': u'astaple', u'name': u'Aaron', u'email': u'aaron@10gen.com'}Message: | ||||||||||||||||||||||||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 28/Mar/12 ] | ||||||||||||||||||||||||||||||||||||||
|
This is scheduled for 2.1.2, so we'll be working on it a couple of weeks. | ||||||||||||||||||||||||||||||||||||||
| Comment by Mathieu Poumeyrol [ 11/Nov/11 ] | ||||||||||||||||||||||||||||||||||||||
|
Anybody listening here ? | ||||||||||||||||||||||||||||||||||||||
| Comment by Mathieu Poumeyrol [ 03/Nov/11 ] | ||||||||||||||||||||||||||||||||||||||
|
In the end, I went for strategy #3. The results are https://github.com/kali/mongo/commit/a559b7132ae5d595e0d63bb7a89231ab9bacb2d8 . It is a bigger patch than I expected, and it will be a pain to rebase to other versions, but if I'm pretty sure I'll get it to work on 2.0 and get us out of trouble for now. Eliot, guys, I would like to know if something of the sort has a remote chance to be integrated. I would assume not before 2.1 anyway, but in that case I would spend a few more hours tidying every loose end I can see or sensible changes you want made. I do think there is value in making the hint system extensible, but the way I chose may not be of your liking. If you don't like the idea at all, please let me know what are the plans (or even lack of plan) to get a fix (or a won't fix) for the range vs multivalued optimization issue. A rough idea of the version it may land would be of great help. Apologies for being a pain, but there is a huge performance regression here, and for us at least, a definite show-stopper for features that have taken already months to implement based on assumptions made on 1.8 behavior. | ||||||||||||||||||||||||||||||||||||||
| Comment by Mathieu Poumeyrol [ 02/Nov/11 ] | ||||||||||||||||||||||||||||||||||||||
|
giving a shot at a patch for strategy #2. | ||||||||||||||||||||||||||||||||||||||
| Comment by Mathieu Poumeyrol [ 02/Nov/11 ] | ||||||||||||||||||||||||||||||||||||||
|
My understanding of the issue is the optimizer refuses to consider the query a range query because it can not know that the inner key, "a", in the previous example, is never multivalued. So I was looking for ways to convey this information to the optimizer: }}}}' | ||||||||||||||||||||||||||||||||||||||
| Comment by Mathieu Poumeyrol [ 01/Nov/11 ] | ||||||||||||||||||||||||||||||||||||||
|
So, with the JS display of the maximum date value set aside, a much simpler testcase is: db.test.drop() , { a: 4}, { a:8} ] }) ) It was passing in 1.8.2 (haven't test 1.8.3) and is breaking in 2.0.1. | ||||||||||||||||||||||||||||||||||||||
| Comment by Mathieu Poumeyrol [ 31/Oct/11 ] | ||||||||||||||||||||||||||||||||||||||
|
OK, I understand the possible semantic ambiguity with multivalued field. But are you sure 2.0 choice is better than the 1.8 ? As a matter of fact, what my query means is "some actual value in the range" not "any absurd combination of unrelated values managing to somehow match the range as a side effect"... The actual query I started investigating with was using $elemMatch (as a third filtering criteria was applied to the a potential matching "vote"). In that case, the ambiguity of the query $lt/$gt over a subdocument was gone. If I understand correctly, the 1.8 behavior is not reproducible with 2.0. That is a big regression as far as we are concerned. Optimization of these queries are a critical issue for us, meaning we will be stuck with 1.8 as long as these range semantics are not somehow restored. Our application can't afford the supplementary data scan the new semantics implies. | ||||||||||||||||||||||||||||||||||||||
| Comment by Aaron Staple [ 31/Oct/11 ] | ||||||||||||||||||||||||||||||||||||||
|
Because your field is multikey we can only use one index bound for the index on this field, not both an upper and lower bound. Please see |