[SERVER-3918] make sparse indexes error out on {$exists: false} queries Created: 21/Sep/11 Updated: 28/Oct/15 Resolved: 27/May/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Querying |
| Affects Version/s: | 2.0.0 |
| Fix Version/s: | 2.1.2 |
| Type: | Bug | Priority: | Minor - P4 |
| Reporter: | Ben Spencer | Assignee: | Aaron Staple |
| Resolution: | Done | Votes: | 12 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Backwards Compatibility: | Major Change | ||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
When querying for documents in which a field does not exist, the optimiser will consider using a sparse index on that field, but this means that it will never find anything. > db.test.insert({}); , {sparse: true, safe: true}); } |
| Comments |
| Comment by Aaron Staple [ 28/Jun/12 ] |
|
Note that the currently implemented behavior (described in the existsa.js test) is to disallow a sparse index even if the field with the $exists:false predicate is not part of the index. This behavior can be overridden by using a query hint to force use of a sparse index. |
| Comment by auto [ 25/May/12 ] |
|
Author: {u'login': u'astaple', u'name': u'Aaron', u'email': u'aaron@10gen.com'}Message: |
| Comment by auto [ 25/May/12 ] |
|
Author: {u'login': u'astaple', u'name': u'Aaron', u'email': u'aaron@10gen.com'}Message: |
| Comment by auto [ 25/May/12 ] |
|
Author: {u'login': u'astaple', u'name': u'Aaron', u'email': u'aaron@10gen.com'}Message: |
| Comment by auto [ 25/May/12 ] |
|
Author: {u'login': u'astaple', u'name': u'Aaron', u'email': u'aaron@10gen.com'}Message: |
| Comment by auto [ 25/May/12 ] |
|
Author: {u'login': u'astaple', u'name': u'Aaron', u'email': u'aaron@10gen.com'}Message: |
| Comment by auto [ 25/May/12 ] |
|
Author: {u'login': u'astaple', u'name': u'Aaron', u'email': u'aaron@10gen.com'}Message: |
| Comment by auto [ 25/May/12 ] |
|
Author: {u'login': u'astaple', u'name': u'Aaron', u'email': u'aaron@10gen.com'}Message: |
| Comment by auto [ 25/May/12 ] |
|
Author: {u'login': u'astaple', u'name': u'Aaron', u'email': u'aaron@10gen.com'}Message: |
| Comment by Thilo Planz [ 01/Feb/12 ] |
|
>" Sparse indexes by design change results (...) To put the other results after that would require doing a full table scan, which is not what people want nor expect." I strongly disagree with that. The presence or absence of an index should only change query performance, not the results. If a sparse index cannot be used properly for $exists, then it should not be used at all, even if that results in a full table scan. What should be documented for sparse indexes is that they cannot be used in these situations (not that they can be used but return different results, making the behaviour unpredictable if you cannot know what index will be used). For the cases where the current behaviour is acceptable or preferred, there should be a separate query operator (something like $exists_sparse) or it would need to be explicitly enabled by a query hint. |
| Comment by Remon van Vliet [ 31/Jan/12 ] |
|
I think Tim's suggestion should definitely be looked at. It's true that for a lot of people the move to sparse indexes are motivated by the requirement that null values are not part of a unique constraint. Fully agree with Paul's comment. |
| Comment by Michael Schwarzberger [ 27/Jan/12 ] |
|
This ticket should be updated to Major Priority. |
| Comment by Michael Schwarzberger [ 26/Jan/12 ] |
|
I totally agree with Paul Canavese that sparse indexes should be used only on positive existence queries. When using a sparse index on a negative existence query the performance is excellent (it finishes in 0ms), but it will show a completely inaccurate result (0 matches). In fact I can't think of any situation where a negativ existence query would return a non-zero result when used with a sparse index. No sensible developer would expect a performance boost here, as the query couldn't be answered from the sparse index in any case. Using $natural:1 is also no good solution, as that would exclude any other indexes from being used. |
| Comment by Paul Canavese [ 13/Jan/12 ] |
|
The other main use case is just to allow efficient lookups on a field that is sometimes null, without having an index bloated with entries for all the null cases. For example, consider a collection with 5 million documents and a "keyword" string field that is assigned 10% of the time. The main use case is looking up documents based on a specific keyword. So I don't want to eat up my memory with 4.5 million null keyword index entries. But there may be secondary types of queries that then have unexpected behavior. Say you have an admin page to show the most recent documents without keyword being set (= null). If the keyword index gets used, you'll get nothing back. If an index on a created_at field is used, you'll get matches back. So with the current model you would need to understand that and hint to make sure you use the index that will return results. For this query, there is no case in which it makes sense to use the keywords index, because it will always return nothing. In terms of selection/filtering, I suppose I could come up with use cases where the current model would be desirable, but they seem much more unusual to me. If I wanted to query for a document with a keyword that is set but matching one or more specific values? So I think for selection I would want sparse indexes (by default at least) only to be used when matching values (or for positive existence), but not for excluding values (or for negative existence). I believe that would always make the results consistent, whether an index is used or not. (Now, I hear Eliot when he's referring to the sorting use case, which makes sense. I'm not entirely sure sure how to address that, although my inclination would still be to require hinting an index if using it would result in different behavior than without the index.) |
| Comment by Tim Olsen [ 13/Jan/12 ] |
|
I only require sparse indexes when I need a unique constraint on a field that can possibly be null (I want the unique constraint to not apply to the nulls). Unfortunately, the sparse index leads to the weird behavior which is the subject of this bug report. Do other users feel like they're being forced into using a sparse index (and the weird behavior that comes with) in this case? Maybe the solution is to allow unique constraints on full indexes to ignore nulls. |
| Comment by Eliot Horowitz (Inactive) [ 13/Jan/12 ] |
|
Sparse indexes by design change results, but we can make it a little easier to work with. For example, if you sort on a sparse field, then you only get results with that field. To put the other results after that would require doing a full table scan, which is not what people want nor expect. |
| Comment by Harry Mexxian [ 13/Jan/12 ] |
|
Perhaps the correct solution is to make the default behavior to NOT use any incomplete indices. Since these are optimization tools, which require some level of database<->app-layer cooperation (at least at the design stage), it's not unreasonable to ask the clients to include a "allow incomplete index usage" flag on a per-query basis. If you want to get real fancy, you could assign a 'consistency/completeness' level to each non-complete index you create. You can then refer to which minimum level is acceptable on a per-query basis. db.test.ensureIndex( {foo: 1}, {sparse: true, safe: true, completeness: 3}db.test.find({foo: {$exists: false}}, {completeness: 2}) ) <--- would not use sparse index This would work really well for multi-level archiving using filtered indexes. Last hour's data could be on a level 1 index, last day level 2 and so forth... all accessible with just the change of a number on the client side. |
| Comment by Paul Canavese [ 12/Jan/12 ] |
|
I agree this is very confusing, and it causes even more confusing side effects with the query optimizer (see https://jira.mongodb.org/browse/SERVER-4665). We had the unfortunate experience of having old working queries in our app break when we later added an index. I vote for indexes only affecting query performance, not behavior. |
| Comment by Fredrik Björk [ 12/Jan/12 ] |
|
Agree, this is not expected behavior. |
| Comment by Neil Sanchala [ 22/Dec/11 ] |
|
I agree with Remon here – the presence of an index should not change the behavior of a query, only its performance. Can http://www.mongodb.org/display/DOCS/Advanced+Queries#AdvancedQueries-%24exists and http://www.mongodb.org/display/DOCS/Indexes#Indexes-SparseIndexes be updated to reflect this design? |
| Comment by Andrew Friedrich [ 28/Nov/11 ] |
|
I ran into this same issue. While it may be documented, it still caused some confusion for me. It does seem odd that queries return different results based the presence of an index. I would prefer a slow query with correct results over a fast query with incorrect results, regardless of my poor index creation. My solution was to simply change the index to a dense index. |
| Comment by Eliot Horowitz (Inactive) [ 20/Nov/11 ] |
|
Its documented at the sparse index docs. Sparse indexes are meant to have different results in certain cases, so not sure where this case falls. |
| Comment by Eliot Horowitz (Inactive) [ 20/Nov/11 ] |
|
Its documented at the sparse index docs. Sparse indexes are meant to have different results in certain cases, so not sure where this case falls. |
| Comment by Remon van Vliet [ 18/Nov/11 ] |
|
I think it would be really good to prioritize query behaviour consistency here and guarantee that indexes will only ever affect performance. I'm aware this might introduce some implementation difficulty and/or reduced performance but that's almost always a smaller issue than "unpredictable" results. For now, I would suggest adding a note to the documentation at both the sparse index documentation as well as $exists. I'd do it but I do not seem to have access. |
| Comment by Eliot Horowitz (Inactive) [ 23/Sep/11 ] |
|
sparse indexes are the only index type which changes results. |
| Comment by Remon van Vliet [ 23/Sep/11 ] |
|
How can this possibly be expected behaviour? Query semantics should be so that col.count() = col.count({f:{$exists:false}}) + col.count({f:{$exists:true}}). This surely can't even be up for debate regardless of the index type. If the query plan cannot use an index to produce the correct results, don't use the index. It's up to the user to make queries that hit indexes but it is not up to the user to have to guesstimate which query produces correct results, especially if this requires more than intermediate knowledge of how the indexes are implemented and which indexes the query optimizer will use for specific queries. |
| Comment by Scott Hernandez (Inactive) [ 22/Sep/11 ] |
|
Yes, when $exists didn't use an index that particular query pattern didn't produce these results, but as you mention, null comparisons still did. This will be even more complicated once filtered indexes are implemented. At that point I'm not sure there can be useful logic for how to make this decision and when special indexes should be used/avoided. It seems like the responsibility must be left to the user to craft queries in ways which make logical sense with the indexes they have defined. SERVER-785 |
| Comment by Ben Spencer [ 22/Sep/11 ] |
|
Note that this a regression from 1.8 (because 1.8 didn't use indexes on $exists queries at all?). Surely the query plan that happens to be chosen shouldn't affect the semantics of the query? In the above example it will always use the sparse index, but in a more complex situation the index chosen could depend on the data in the collection. A similar effect occurs when querying for {foo: null}. If the sparse index is used, it will only return docs with foo explicitly set to null; if not, it will also return docs without foo set (this behaviour is the same in 1.8 however). |
| Comment by Eliot Horowitz (Inactive) [ 22/Sep/11 ] |
|
That is expected. |