[SERVER-15086] Allow for efficient range queries over non-array fields in multikey indices Created: 28/Aug/14 Updated: 10/Nov/20 Resolved: 09/Jun/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Performance, Querying |
| Affects Version/s: | 2.4.11, 2.6.4 |
| Fix Version/s: | 3.3.8 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Ben Rotz | Assignee: | Max Hirschhorn |
| Resolution: | Done | Votes: | 16 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Major Change | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Query 16 (06/24/16) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
Issue Status as of Jun 06, 2016 ISSUE SUMMARY Support for tracking which indexed fields cause the index to be multikey (aka "path-level multikey tracking") has been implemented in MongoDB 3.3.8. The query planner then uses this information to get tighter bounds on queries by assigning additional predicates to the index. SUPPORTED STORAGE ENGINES This feature is not supported for the MMAPv1 storage engine. MMAPv1 support for path-level multikey information is tracked in IMPACT ON DOWNGRADE Downgrading from MongoDB 3.3.8 to a version of MongoDB older than 3.2.7 may produce the following assertion message if indexes contain path-level multikey information:
See TECHNICAL DETAILS Tighter bounds in 3.3.8+Using the path-level multikey information we track for an index, we can search a bounded interval (thus having intersected the bounds) on a field in the compound multikey index that doesn't ever refer to an array value contain multiple elements. In the following example, the "a" field is a scalar and the "b" field is an array value containing multiple elements.
Contrasted with 3.2 and earlierIn MongoDB 3.2 and earlier, the query planner doesn't have enough information to know that the "a" field is never an array value in some document in the collection (e.g. that the document {a: [1, 2, 3], b: 5} doesn't also exist in the collection). Without this additional information, the query planner must assume that the {a: {$gte: 0}} condition could apply to a different element than the {a: {$lt: 10}} condition. Thus, we can only search a left-bounded or right-bounded interval on the "a" field and apply the other condition as a subsequent filter:
Explain output in 3.3.8+The explain output for the COUNT_SCAN, DISTINCT_SCAN, and IXSCAN stages has been updated to include an additional "multiKeyPath" field for indexes that support this feature. The "multiKeyPath" field describes what prefixes of the indexed fields cause the index to be multikey:
Updated descriptionSuppose you have documents of the form
with the index {"a.b": 1}. With this schema, there is currently no way to constrain the index bounds with multiple predicates. Users who need to do this generally must change their schema. Instead we could provide new query language semantics to support this use case. See the comments below or related tickets Original DescriptionIt looks like there is unexpected slow performance on a range query when using $elemMatch. I first encountered it when using the aggregation framework and using the $match pipeline with $elemMatch, and have now backed it out into an example with $find. The performance hit is quite large, 20x slower or so, when I believe, if anything, the query should actually run FASTER, not 20x slower. I could try and troubleshoot more, but I don't totally understand the explain() output when doing ranges of dates. Sometimes explain() puts the indexBounds as 'true' in the explain(), and sometimes explain() put the indexBounds as 'ISODate("0NaN-NaN-NaNTNaN:NaN:NaNZ")', and I"m not 100% sure how to interpret that, other than I believe the equivalent is "no bound" (meaning the query is not bounding itself at the top or bottom of the range properly). |
| Comments |
| Comment by Hoyt Ren [ 22/Mar/16 ] | ||||||||||||||||||||||||||||||||||||||||
|
Thanks for the efforts. I know the difficult. When filter records, it's supper easy that know a name is something, but when select index, we must know the whole collection, this isn't easy as mongodb is trying to provide freedom too. Expecting your success, this will be a really useful feature. | ||||||||||||||||||||||||||||||||||||||||
| Comment by Asya Kamsky [ 21/Mar/16 ] | ||||||||||||||||||||||||||||||||||||||||
|
Hoyt Ren like I said above, we are working on a solution that would not require users to change their queries. As long as the query expresses correct boundaries (via $elemMatch for arrays if a single element must satisfy multiple conditions) the goal of this ticket is for query optimizer to have enough information to be able to pick the best index with narrowest possible bounds. Thanks for your feedback and please continue to watch this ticket for updates. | ||||||||||||||||||||||||||||||||||||||||
| Comment by Hoyt Ren [ 21/Mar/16 ] | ||||||||||||||||||||||||||||||||||||||||
|
Ok, I'm too arbitrary. I agree that I should not assert that people won't do that, as mongodb already allow to do. I don't have a case doesn't mean others not. Then I guess what we could expect is a way that set a constraint that tell mongodb a name is scalar or not. Another option may be import a new operator like $elemMatch, that do the expected work, that explicitly tell how to use index. However both ways may not make things simple and clear, it's enough that had a way. | ||||||||||||||||||||||||||||||||||||||||
| Comment by Asya Kamsky [ 21/Mar/16 ] | ||||||||||||||||||||||||||||||||||||||||
|
It is absolutely not the case that a:1 and a:[1] are the same. $elemMatch should and will match one of them and not the other. If we change that behavior it will break existing code. It doesn't matter if schema that would be relying on this may not be the best design - plus there may be cases where such design is necessary and was created deliberately. | ||||||||||||||||||||||||||||||||||||||||
| Comment by Hoyt Ren [ 21/Mar/16 ] | ||||||||||||||||||||||||||||||||||||||||
|
Hi Asya, In I know you are thinking of backwards breaking, I believe it's not a problem, because currently no one use {'a.b':{$elemMatch:{$gte:1,$lte:10}}} as it doesn't work and in wrong expression. I believe scalar is a special case of array, that has only 1 element. When we know a name is a scalar, we will not apply $elemMatch in fact, so even if we let $elemMatch mach scalar, it doesn't change any result. I mean: I don't have a use case that somebody has a data like: {a:5} {a:[1,3,10]}And he expect return only {a:[1,3,10]} by use {A:{$elemMatch:{$gte:1,$lte:10}}} Note: Sorry, I'm trying to fix the format problem. Now I found the review and help button. | ||||||||||||||||||||||||||||||||||||||||
| Comment by Asya Kamsky [ 19/Mar/16 ] | ||||||||||||||||||||||||||||||||||||||||
|
When $elemMatch is used over array element it does mean that the same element of the array has to satisfy the conditions given. However, since currently in compound indexes we don't know which of the fields is (or can be) an array, we can't apply tighter bounds to non-array field. It's true that if you could specify $elemMatch on non-array you could "indicate" to the optimizer to do so, however, $elemMatch will only match an array element, so it won't match values that are not an array. Since that would cause a different query result, we can't change that (for more discussion, see We are currently working on improvements that will allow this issue to be resolved. | ||||||||||||||||||||||||||||||||||||||||
| Comment by Hoyt Ren [ 19/Mar/16 ] | ||||||||||||||||||||||||||||||||||||||||
|
Hi All, I read I believe the root problem is, currently, When we say {$and:[{A: {$gt: 3}}, {A: {$lt: 6}}]}, we don't know 'name A' is a scalar or a set, in the worst case it's a set, so we can't intersect the index. But when we say A:{$gt:3, $lt: 6}, this in fact saying A[0]:{$gt: $lt: 6}, A[1]:{$gt: $lt: 6}... Here A is the 'current element of A', but not the 'name A', and the index could be intersect. The mongodb language currently doesn't catch our idea in the natural language. So I believe we should let mongodb understand this, or have a way, be more explicitly, tell mongodb we are saying 'name A' or 'current element of A'. I agree with Derek, at least A:{$elemMatch{$gt:3,$lt:6}} should work, and the index should be intersected in this case. | ||||||||||||||||||||||||||||||||||||||||
| Comment by David Storch [ 12/Nov/15 ] | ||||||||||||||||||||||||||||||||||||||||
|
Hi Hoyt Ren, please re-open | ||||||||||||||||||||||||||||||||||||||||
| Comment by Hoyt Ren [ 12/Nov/15 ] | ||||||||||||||||||||||||||||||||||||||||
|
After I upgrade my production server from 3.0.5 to 3.0,7, the problem, which is described in | ||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Barszcz [ 21/Oct/15 ] | ||||||||||||||||||||||||||||||||||||||||
|
Yes after reading through the ticket more I have a better understanding. At first I thought this issue was about range queries on the multikey field specifically, but it's more so about range queries on a multikey index | ||||||||||||||||||||||||||||||||||||||||
| Comment by David Storch [ 21/Oct/15 ] | ||||||||||||||||||||||||||||||||||||||||
|
Hi mbarszcz, correct, the behavior you described is exactly the limitation which we hope to fix under this ticket. Hopefully the ticket description and previous comments provide enough context on the issue---why this is currently the expected behavior, and what we plan to do in the future to fix it. | ||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Barszcz [ 20/Oct/15 ] | ||||||||||||||||||||||||||||||||||||||||
|
I'm assuming this behavior for a compound multikey index also falls under this issue? Say I have documents in the formats below:
With compound index:
A query like this is still unable to use the upper and lower bound on the time field, which is the second piece of the index:
This was taken from an actual system, but you can see the lower bound on the time isn't used in the index scan (sorry dates are stored as microseconds since unix epoch in real system)
I'm having a hard time understanding why this is the case | ||||||||||||||||||||||||||||||||||||||||
| Comment by Derek Wilson [ 29/Jan/15 ] | ||||||||||||||||||||||||||||||||||||||||
|
I'm not sure why this shouldn't work: db.test.find({data: {$elemMatch: {point: {$elemMatch: {$gte: 2, $lte: 2}}}}}) It seems like if for indexing its going to treat data.point as an array when it has multiple values (in the case that it doesn't know its not an array) from different subdocs, that it should also treat it as if it were an array itself and let me do an elemMatch on it? the way of expressing this makes me think that the best solution to this is something like | ||||||||||||||||||||||||||||||||||||||||
| Comment by David Storch [ 28/Jan/15 ] | ||||||||||||||||||||||||||||||||||||||||
|
Hi Ben and Derek, We will keep these suggestions in mind as we plan for future releases. We understand that this is an important limitation for a common use case, and have taken note of the attention that this ticket (and sibling tickets such as Best, | ||||||||||||||||||||||||||||||||||||||||
| Comment by Ben Rotz [ 28/Jan/15 ] | ||||||||||||||||||||||||||||||||||||||||
|
I agree with the concept of having the index have additional parameters that could enforce the way documents are selected from a collection. You could also do an index transform, for example, to create a case insensitive index (and when using this index, because the keys are stored all lowercase, it works as a case insensitive index). I hope someone at mongo considers what Derek suggested. I would love to hear back from someone on this suggestion. | ||||||||||||||||||||||||||||||||||||||||
| Comment by Derek Wilson [ 28/Jan/15 ] | ||||||||||||||||||||||||||||||||||||||||
|
Hi, I have been working with a similar use case where I needed to select documents based on a bounded range of values within an array of subdocuments. This really does seem like a common use case to me and I hope it is fixed quickly. What I would really like to add to this issue is that the problem didn't pop up for me until much later than I would have expected. My document structure looked like this during testing:
My index was set on a.b, and i used queries like {a: {$elemMatch: {b: {$gt:1, $lt: 3}}}} which worked fine. The structure was designed to support one or more documents in "a" but mostly there would be only one and on the occasion that it is more than one it is few. So, the way I first noticed this was with a single shard slowing down as soon as data showed up with more than one subdoc. It wasn't immediately obvious that this was the issue. It really does kill performance for these types of queries and it makes them very painful. I do understand the issue of needing to deal with the case where b holds an array of values itself and is also inside an array of subdocs. But i disagree with the idea that you should require different query semantics to handle this case. I also don't see why you would require full schema validation as an alternative. Both of those options seem very heavy weight to me. it seems like something along the lines of And if something like that could work, it also seems like i could/should be able to indicate this at the index level at index creation time... and i shouldn't need schema validation because really what i'm indicating is the way i expect my indexed queries to be planned rather than on what data to enforce. As you point out, you need to plan for worst case scenario by default - but if indicate that i want to exclude handling certain worst cases then i'm also indicating that i'm willing to accept what seems like at worst a subset of correct results in the case that I crap up my data. Really if its just a planning override at the index level i could also see cases where you have something like
as a valid option along side my use case and you still only want "b"s where any single value matches all the conditions and you want to use the index. | ||||||||||||||||||||||||||||||||||||||||
| Comment by Ben Rotz [ 15/Sep/14 ] | ||||||||||||||||||||||||||||||||||||||||
|
Hi Dave, Your summary captures the whole thing well. I had hoped that this issue would have been moved to a planned version by now. It seems that this would be a common issue, but perhaps my schema is more uncommon than I expected, and I am the only one making noise about this issue. In addition to the previously suggested fixes, another possibility could be a new operator "$scalarMatch" which works like "$elemMatch" only with scalars (exclusively). Personally I can't imagine a use case in which someone would intentionally create a schema which the same field name stores both scalars and arrays, and even if they did, then also wanting to perform a find/match against it and exclude scalar matches. So I personally prefer having $elemMatch match scalar matches as well as array matches (simpler usage, simpler documentation, etc.) , instead of a new operator (which then requires additional documentation, and people saying "well wtf do I need this for", etc.). | ||||||||||||||||||||||||||||||||||||||||
| Comment by David Storch [ 02/Sep/14 ] | ||||||||||||||||||||||||||||||||||||||||
|
Hi nefiga, Thanks for all your thoughts on this! I'll try to respond to all your points above. Please let me know if I missed anything.
Yes, I'll definitely make sure that this is on the radar. I can't make any promises about a specific timeline or fix version, but it's always good to hear from users about important use cases in order to help us improve the product.
Yes, currently the query engine only knows whether or not an index is multikey. Since we don't have schema validation, the system always assumes the worst case (i.e. assume anything might be an array). This is necessary in the current system for correctness of query results. I can imagine fixing this and related problems in a more general way if we introduce schema validation down the line.
As Asya mentioned above, something like this has already been proposed in
There are definitely a few possibilities for fixing this. Please continue to watch this ticket in order to track further developments. Best, | ||||||||||||||||||||||||||||||||||||||||
| Comment by Asya Kamsky [ 01/Sep/14 ] | ||||||||||||||||||||||||||||||||||||||||
|
I believe the last proposal is related to | ||||||||||||||||||||||||||||||||||||||||
| Comment by Ben Rotz [ 30/Aug/14 ] | ||||||||||||||||||||||||||||||||||||||||
|
Hi Dave,
sums up the issue at hand nicely. Using your idea above, let's say we had:
If we run the query
We get all records back, which is correct (although some users may have expected record 1 to be excluded, but we both know, it is technically correct that it is included), but our query does NOT use the index upper bound as it would incorrectly filter out the 1st record. However, if we run the query
We get only records 2 and 3 back (excluding records 1 and 4). But the index is used as desired (having an upper and lower bound). I agree that record 1 should be excluded for this query, but I would argue that record 4 should have been included, as it also opens the door for doing $elemMatch in situations such as mine. And seeing it described in this way makes me agree with 6050 for inclusion. I think what we've done here is basically provide a more detailed analysis of 6050, and hopefully I've shown why it is not only a good idea, but necessary to get indexing to work for the common scenario of indexing an array of objects with a timestamp (or other value that would be used in a "range", but timestamp seems very common). I would put this forth as a possible solution, and I find it to be an urgent fix and send the strongest desires possible over the internet tubes to you to get this set for a fix in 2.7/2.8. Also, I also realized the problem comes about whenever an array at ANY level gets used in the index. At first I was confused what the difference was between this and just doing a regular range query on a simple index on a collection without any arrays or subdocuments. From what I can tell, when an index is building in Mongo, if any of the index fields are an array, the value for isMultiKey is set to true. This means for an index like {'e.t' : 1}, if either "e" or "t" has arrays of size > 1, the isMultiKey gets set to true. That can been seen in the following example:
The only reason I illustrate the above example was to draw a POSSIBLE second solution to the problem, and that is to change "isMultiKey" to be an array of values to indicate which levels of the index are multikey, instead of just a simple boolean. That is to say, for the data
Instead of having
(which I believe is what is telling the query engine to use a "true" for the lower bound of the e.t index for a range query), you could possibly have
which I can imagine would address this issue since we could see that the second field of the index ("t") does not utilize an array of size > 1 (or is a scalar). There could perhaps be other ways to store this information, but I believe the example isMultiKey syntax above would store all the necessary information to know that indexes on "e" have a multikey, but indexes on the "t" portion of "e" (as in e.t.) do not. So, in short, it seems there are two possible fixes here:
#1 seems like the easiest fix. But is it the best? Part of me thinks that the better answer is #2, but I suppose if someone ever had a situation in which the "t" in the "e.t" example was sometimes an array, and sometimes a scalar (I personally would never model things this way, but I digress), then they would want #1, since having #2 would still pose a potential index problem when doing a range. Good thing there are geniuses that work at 10gen who can help find the best answer. Maybe the best answer is actually to do both suggestions. If there is anything I can do to get this on someone's radar to address sooner rather than later, please let's do it! I'm not a billionaire (sorry!), but I'd gladly order the whole office pizza and drinks, just email me the office address of where to send it! Thank you!!! | ||||||||||||||||||||||||||||||||||||||||
| Comment by David Storch [ 29/Aug/14 ] | ||||||||||||||||||||||||||||||||||||||||
|
Hi nefiga, Thanks for the detailed bug report. As I think you already discovered in your investigations, this is expected behavior. You are right that you shouldn't have to bend over backwards in order to get efficient range queries over nested documents inside an array. However, our query language doesn't yet have a good mechanism for this. Therefore, I'm going to keep this ticket open as a feature request for our query language. Here's a quick explanation of why both predicates are not used to constrain the index scan in the example above. Suppose you have a collection with index {"a.b": 1}. Every document in the collection has an array "a" containing nested documents with a field "b":
When you query on a particular range of "b" values, any indexed plan created by the query planner must not use both predicates to constrain the index scan:
This is true even if you use an $elemMatch on "a":
Suppose for argument's sake that the query planner generated a plan which scanned the doubly-open interval (3, 6) for the index {"a.b": 1}. Why is this wrong? Well suppose that we insert a document for which "a.b" is the array [2, 7]:
This document matches the query: "a" is indeed an array which has a single nested document matching both the conditions that b>3 and b<6. However, if we were to scan the index only for keys on the interval (3, 6), we would miss this document. This, of course, is only a problem if "a.b" is an array. Users who maintain the invariant that "a.b" is not an array should have a way of specifying a more efficient query. I am converting this ticket into a request for such a mechanism. Please feel free to reach out with any further questions. Best, | ||||||||||||||||||||||||||||||||||||||||
| Comment by Ben Rotz [ 29/Aug/14 ] | ||||||||||||||||||||||||||||||||||||||||
|
After further discovery, it looks like https://jira.mongodb.org/browse/SERVER-6050 is indeed the answer to this problem, even though it just sounds wrong initially. I find this is fairly ridiculous and it is not at all clear or obvious by any stretch. The only workaround I can come up with is to take the ticket-mentioned example and make e.t be an array of a single time value, instead of just a scalar time value. That is,
instead of
Then the query
seems to be both accurate and use the index. However, since I'm actually doing several conditions in the elemMatch, I have to take it one step further and come up with a more complex query. Given the test data:
With index
The query to accurately get records with events between 2014-08-20 12 and 2014-08-20 13 that also have a value of 1,2,3 or 4 (excluding 0 and 5), becomes:
I follow the logic behind all of this, and would agree with it, except for the portion that requires me to put "t" : ISODate() into an array to get the index to work properly just because I happen to be using a range. That part is lame and needs to be fixed ASAP. I very much look forward to any feedback or suggestions or anything on this issue so I know if this is a priority/concern to anyone else. From what I gather, a fix to something like 6050 would have my existing schema working properly with the query modification being that apparently the inner $elemMatch is required to get the range to properly use the index. This is a little confusing to me, but hopefully this conversation doesn't die as I'd like to get a solution in place here. Thanks!! | ||||||||||||||||||||||||||||||||||||||||
| Comment by Ben Rotz [ 28/Aug/14 ] | ||||||||||||||||||||||||||||||||||||||||
|
I'm just starting to realize how bad this actually is. If I understand correctly, there is actually NO WAY to effectively use an index to create a bounded range query around a scalar value that resides in an array of objects. In other words, if I have a structure like
*1,000,000 more records There is actually NO WAY to say, get me the records with e.t between 2014-08-13T17:30:00Z and 2014-08-13T18:30:00Z without doing basically a mostly full table scan. Is that correct? I mean at best I pick the datetime closest to either "edge" of my data, and hopefully I'm near an edge (the beginning of time for my data, or the end of time), but if I pick an arbitrary date right in the middle of my data, I'm hosed? No way to have an index that would work in that situation? | ||||||||||||||||||||||||||||||||||||||||
| Comment by Ben Rotz [ 28/Aug/14 ] | ||||||||||||||||||||||||||||||||||||||||
|
I see this is somewhat of a rehash of this issue: https://jira.mongodb.org/browse/SERVER-8610, which points to https://jira.mongodb.org/browse/SERVER-6050 I think that the recommendation in https://jira.mongodb.org/browse/SERVER-6050 is halfway broken, $elemMatch should NOT match flat data, that makes no sense and I cannot see the reasoning behind that. Basically I agree with the comment by Eduardo Manso in issue 6050. What I can understand is applying a range upper and lower bound to a scalar, which it seems like OP is trying to do. Which is what I am trying to do as well. This is a major headache for me as it means my main queries in my system are all running much slower than if the range just worked properly as expected. I urgently ask for a prioritization of this issue. If there is any assistance or debugging I can do to push this along, please let me know. Also let me know if there are any acceptable workarounds which do not require manipulation of source data. I understand you can use "hints", to force the index, but currently there is not a way to force "hints" for aggregation framework, so it isn't a full workaround. Also related to https://jira.mongodb.org/browse/SERVER-7959 |