[SERVER-15086] Allow for efficient range queries over non-array fields in multikey indices Created: 28/Aug/14  Updated: 10/Nov/20  Resolved: 09/Jun/16

Status: Closed
Project: Core Server
Component/s: Performance, Querying
Affects Version/s: 2.4.11, 2.6.4
Fix Version/s: 3.3.8

Type: Improvement Priority: Major - P3
Reporter: Ben Rotz Assignee: Max Hirschhorn
Resolution: Done Votes: 16
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
duplicates SERVER-8610 $elemMatch (objects in array) not usi... Closed
is duplicated by SERVER-7959 Potentially unexpected scans with com... Closed
is duplicated by SERVER-10436 wrong index ranges when using compoun... Closed
is duplicated by SERVER-14618 Wrong index bounds when using "hint" Closed
is duplicated by SERVER-15059 index on date range on embedded colle... Closed
is duplicated by SERVER-20302 Strange execution plan for range quer... Closed
is duplicated by SERVER-23038 Why index dosn't working? Closed
is duplicated by SERVER-23079 Why sub document cannot match index? Closed
is duplicated by SERVER-24786 Compound multikey index does not comp... Closed
is duplicated by SERVER-26978 Query Optimizer not using index corre... Closed
is duplicated by SERVER-27386 Poor use of index Closed
is duplicated by SERVER-28010 Creating index on array of sub docume... Closed
is duplicated by SERVER-4468 Keep track of multi-key fields in index Closed
is duplicated by SERVER-6050 Consider allowing $elemMatch applied ... Closed
is duplicated by SERVER-14944 Use of Indexes for Query Closed
Related
related to SERVER-31444 Queries against multikey trailing fie... Closed
related to SERVER-17064 indexBounds error Closed
related to SERVER-18115 The planner can add an unnecessary in... Closed
related to SERVER-21963 Index range scan inaccurate when comp... Closed
related to SERVER-4463 Allow covered index use for non-array... Closed
related to SERVER-4468 Keep track of multi-key fields in index Closed
is related to SERVER-21178 Super slow query and increased memory... Closed
is related to DOCS-6233 We should clarify that range queries ... Closed
is related to SERVER-22727 Extend the MMAP index catalog to stor... Closed
is related to SERVER-7566 simpler way to figure out if an index... Backlog
is related to SERVER-3173 Planner should use path-level multike... Closed
Backwards Compatibility: Major Change
Sprint: Query 16 (06/24/16)
Participants:
Case:

 Description   
Issue Status as of Jun 06, 2016

ISSUE SUMMARY
In versions of MongoDB 3.2 and earlier, the metadata associated with an index contains a single bit to indicate whether or not the index is multikey. There is no granular information to indicate which indexed fields cause the index to be multikey. This has implications for compound multikey indexes because it prevents the query planner from using the index as efficiently as theoretically possible (i.e. given perfect knowledge about the structure of the documents in the index).

Support for tracking which indexed fields cause the index to be multikey (aka "path-level multikey tracking") has been implemented in MongoDB 3.3.8. The query planner then uses this information to get tighter bounds on queries by assigning additional predicates to the index.

SUPPORTED STORAGE ENGINES
This feature is supported for the WiredTiger and InMemory storage engines only.

This feature is not supported for the MMAPv1 storage engine. MMAPv1 support for path-level multikey information is tracked in SERVER-22727.

IMPACT ON DOWNGRADE
It is possible to downgrade from MongoDB 3.3.8 to MongoDB 3.2 only without first dropping all of the indexes that were created on MongoDB 3.3.8. However, it is only possible downgrade to MongoDB 3.2.7 or newer. Downgrading to MongoDB 3.2.7 will automatically delete the path-level multikey information from the index metadata. The indexes will retain only the index-level flag that indicates whether the index is multikey.

Downgrading from MongoDB 3.3.8 to a version of MongoDB older than 3.2.7 may produce the following assertion message if indexes contain path-level multikey information:

2016-05-31T13:18:11.090-0400 I -        [initandlisten] Assertion: 13111:wrong type for field (ns) 10 != 2

See SERVER-23117 for more information on this downgrade requirement.

TECHNICAL DETAILS

Tighter bounds in 3.3.8+

Using the path-level multikey information we track for an index, we can search a bounded interval (thus having intersected the bounds) on a field in the compound multikey index that doesn't ever refer to an array value contain multiple elements. In the following example, the "a" field is a scalar and the "b" field is an array value containing multiple elements.

Index: {a: 1, b: 1}
Document: {a: 5, b: [1, 2, 3]}
⇒ Multikey paths: ["b"]
Query: {a: {$gte: 0, $lt: 10}}
⇒ Bounds on "a": [0, 10)
⇒ Bounds on "b": [MinKey, MaxKey]

SERVER-22401 has additional examples of what bounds can be generated for compound multikey indexes given a document structure and query.

Contrasted with 3.2 and earlier

In MongoDB 3.2 and earlier, the query planner doesn't have enough information to know that the "a" field is never an array value in some document in the collection (e.g. that the document {a: [1, 2, 3], b: 5} doesn't also exist in the collection). Without this additional information, the query planner must assume that the {a: {$gte: 0}} condition could apply to a different element than the {a: {$lt: 10}} condition. Thus, we can only search a left-bounded or right-bounded interval on the "a" field and apply the other condition as a subsequent filter:

{
    "stage" : "FETCH",
    "filter" : {
        "a" : {
            "$gte" : 0
        }
    },
    "inputStage" : {
        "stage" : "IXSCAN",
        "keyPattern" : {
            "a" : 1,
            "b" : 1
        },
        "indexName" : "a_1_b_1",
        "isMultiKey" : true,
        ...
        "indexBounds" : {
            "a" : [
                "[-inf.0, 10.0)"
            ],
            "b" : [
                "[MinKey, MaxKey]"
            ]
        }
    }
}

Explain output in 3.3.8+

The explain output for the COUNT_SCAN, DISTINCT_SCAN, and IXSCAN stages has been updated to include an additional "multiKeyPath" field for indexes that support this feature. The "multiKeyPath" field describes what prefixes of the indexed fields cause the index to be multikey:

{
    "stage" : "IXSCAN",
    "keyPattern" : {
        "a" : 1,
        "b" : 1
    },
    "indexName" : "a_1_b_1",
    "isMultiKey" : true,
    "multiKeyPaths" : {
        "a" : [ ],
        "b" : [
            "b"
        ]
    },
    ...
}

SERVER-23115 has additional information regarding the "mulitKeyPaths" field in the explain output.

Updated description

Suppose you have documents of the form

{a: [{b: 1}, {b: 2}, {b: 3}]}
{a: [{b: 4}, {b: 5}, {b: 6}]}

with the index {"a.b": 1}. With this schema, there is currently no way to constrain the index bounds with multiple predicates. Users who need to do this generally must change their schema. Instead we could provide new query language semantics to support this use case. See the comments below or related tickets SERVER-8610, SERVER-7959, and SERVER-6050 for more details.

Original Description

It looks like there is unexpected slow performance on a range query when using $elemMatch. I first encountered it when using the aggregation framework and using the $match pipeline with $elemMatch, and have now backed it out into an example with $find.

The performance hit is quite large, 20x slower or so, when I believe, if anything, the query should actually run FASTER, not 20x slower.

I could try and troubleshoot more, but I don't totally understand the explain() output when doing ranges of dates. Sometimes explain() puts the indexBounds as 'true' in the explain(), and sometimes explain() put the indexBounds as 'ISODate("0NaN-NaN-NaNTNaN:NaN:NaNZ")', and I"m not 100% sure how to interpret that, other than I believe the equivalent is "no bound" (meaning the query is not bounding itself at the top or bottom of the range properly).



 Comments   
Comment by Hoyt Ren [ 22/Mar/16 ]

Thanks for the efforts. I know the difficult. When filter records, it's supper easy that know a name is something, but when select index, we must know the whole collection, this isn't easy as mongodb is trying to provide freedom too. Expecting your success, this will be a really useful feature.

Comment by Asya Kamsky [ 21/Mar/16 ]

Hoyt Ren like I said above, we are working on a solution that would not require users to change their queries. As long as the query expresses correct boundaries (via $elemMatch for arrays if a single element must satisfy multiple conditions) the goal of this ticket is for query optimizer to have enough information to be able to pick the best index with narrowest possible bounds.

Thanks for your feedback and please continue to watch this ticket for updates.

Comment by Hoyt Ren [ 21/Mar/16 ]

Ok, I'm too arbitrary. I agree that I should not assert that people won't do that, as mongodb already allow to do. I don't have a case doesn't mean others not.

Then I guess what we could expect is a way that set a constraint that tell mongodb a name is scalar or not.

Another option may be import a new operator like $elemMatch, that do the expected work, that explicitly tell how to use index.

However both ways may not make things simple and clear, it's enough that had a way.

Comment by Asya Kamsky [ 21/Mar/16 ]

It is absolutely not the case that a:1 and a:[1] are the same. $elemMatch should and will match one of them and not the other.

If we change that behavior it will break existing code. It doesn't matter if schema that would be relying on this may not be the best design - plus there may be cases where such design is necessary and was created deliberately.

Comment by Hoyt Ren [ 21/Mar/16 ]

Hi Asya,

In SERVER-6050, I agree with Eduardo, that Aaron make mistake and he write wrong sentences, It's not mongodb's fault.

I know you are thinking of backwards breaking, I believe it's not a problem, because currently no one use {'a.b':{$elemMatch:{$gte:1,$lte:10}}} as it doesn't work and in wrong expression.

I believe scalar is a special case of array, that has only 1 element. When we know a name is a scalar, we will not apply $elemMatch in fact, so even if we let $elemMatch mach scalar, it doesn't change any result.

I mean:
When A is scalar, {A:{$gte:1,$lte:10}} and {A:{$elemMatch:{$gte:1,$lte:10}}} , should be same, and no one had artificially added the '$elemMatch'.
When A is array, {A:{$gte:1,$lte:10}} work like current, but {A:{$elemMatch:{$gte:1,$lte:10}}} could work as we expected.
Then I don't think here are backwards breaking problem.

I don't have a use case that somebody has a data like:

{a:5} {a:[1,3,10]}

And he expect return only

{a:[1,3,10]}

by use {A:{$elemMatch:{$gte:1,$lte:10}}}
I don't understand why he design the data structure mixed scalar and array and hope this affect the logic. I do believe this is a bad idea.
Then, when {A:{$elemMatch:{$gte:1,$lte:10}}} return both document, everything is correct.

Note: Sorry, I'm trying to fix the format problem. Now I found the review and help button.

Comment by Asya Kamsky [ 19/Mar/16 ]

Hoyt Ren

When $elemMatch is used over array element it does mean that the same element of the array has to satisfy the conditions given. However, since currently in compound indexes we don't know which of the fields is (or can be) an array, we can't apply tighter bounds to non-array field. It's true that if you could specify $elemMatch on non-array you could "indicate" to the optimizer to do so, however, $elemMatch will only match an array element, so it won't match values that are not an array. Since that would cause a different query result, we can't change that (for more discussion, see SERVER-6050 which proposed allowing $elemMatch to match non-array elements).

We are currently working on improvements that will allow this issue to be resolved.

Comment by Hoyt Ren [ 19/Mar/16 ]

Hi All,

I read SERVER-6720, where David explained why the index can't be intersected.

I believe the root problem is, currently,
$and:[{A: {$gt: 3}}, {A: {$lt: 6}}]
and
A:{$gt: 3, $lt: 6}
are same thing in the mongodb language, but they may not in people's idea.

When we say {$and:[{A: {$gt: 3}}, {A: {$lt: 6}}]}, we don't know 'name A' is a scalar or a set, in the worst case it's a set, so we can't intersect the index.

But when we say A:{$gt:3, $lt: 6}, this in fact saying A[0]:{$gt: $lt: 6}, A[1]:{$gt: $lt: 6}... Here A is the 'current element of A', but not the 'name A', and the index could be intersect.

The mongodb language currently doesn't catch our idea in the natural language. So I believe we should let mongodb understand this, or have a way, be more explicitly, tell mongodb we are saying 'name A' or 'current element of A'.

I agree with Derek, at least A:{$elemMatch{$gt:3,$lt:6}} should work, and the index should be intersected in this case.

Comment by David Storch [ 12/Nov/15 ]

Hi Hoyt Ren, please re-open SERVER-21178 with the relevant details if you believe that your issue was not resolved by the changes in SERVER-16042.

Comment by Hoyt Ren [ 12/Nov/15 ]

After I upgrade my production server from 3.0.5 to 3.0,7, the problem, which is described in SERVER-21178, still happen frequently. Do I have to rebuild index or do something, any suggestion? Hope you can promote the priority of this issue in new version.

Comment by Michael Barszcz [ 21/Oct/15 ]

Yes after reading through the ticket more I have a better understanding. At first I thought this issue was about range queries on the multikey field specifically, but it's more so about range queries on a multikey index

Comment by David Storch [ 21/Oct/15 ]

Hi mbarszcz, correct, the behavior you described is exactly the limitation which we hope to fix under this ticket. Hopefully the ticket description and previous comments provide enough context on the issue---why this is currently the expected behavior, and what we plan to do in the future to fix it.

Comment by Michael Barszcz [ 20/Oct/15 ]

I'm assuming this behavior for a compound multikey index also falls under this issue?

Say I have documents in the formats below:

{ t: "2015-10-14T00:00:00", s: "SV103", c: [ { n: "bar", v: 1 } ] }
{ g: "TM_PUS_001_001", t: "2015-10-14T01:00:00", s: "SV103", c: [ {n: "foo", v: 427 }, {n: "baz", v: true}, {n: "boo", v: "ON"} ] }

With compound index:

{ "c.n": 1, t: 1, s: 1}

A query like this is still unable to use the upper and lower bound on the time field, which is the second piece of the index:

db.test.find( { "c.n": "bar", t: { $gte: "2015-10-10T00:00:00", $lt: "2015-10-15T00:00:00" }, s: "SV103" })

This was taken from an actual system, but you can see the lower bound on the time isn't used in the index scan (sorry dates are stored as microseconds since unix epoch in real system)

 "indexName" : "c.n_1_t_1_s_1",
 "isMultiKey" : true,
 "direction" : "forward",
 "indexBounds" : {
               "c.n" : [
                         "[\"CMD_ACCCTR\", \"CMD_ACCCTR\"]"
               ],
               "t" : [
                     "[-inf.0, 1444831200000000.0)"
               ],
               "s" : [
                      "[\"SV103\", \"SV103\"]"
               ]
 },
 "keysExamined" : 1147,

I'm having a hard time understanding why this is the case

Comment by Derek Wilson [ 29/Jan/15 ]

I'm not sure why this shouldn't work:

db.test.find({data: {$elemMatch: {point: {$elemMatch: {$gte: 2, $lte: 2}}}}})

It seems like if for indexing its going to treat data.point as an array when it has multiple values (in the case that it doesn't know its not an array) from different subdocs, that it should also treat it as if it were an array itself and let me do an elemMatch on it?

the way of expressing this makes me think that the best solution to this is something like SERVER-6050 instead of 3173

Comment by David Storch [ 28/Jan/15 ]

Hi Ben and Derek,

We will keep these suggestions in mind as we plan for future releases. We understand that this is an important limitation for a common use case, and have taken note of the attention that this ticket (and sibling tickets such as SERVER-3173) have received. Please continue to watch for progress updates.

Best,
Dave

Comment by Ben Rotz [ 28/Jan/15 ]

I agree with the concept of having the index have additional parameters that could enforce the way documents are selected from a collection. You could also do an index transform, for example, to create a case insensitive index (and when using this index, because the keys are stored all lowercase, it works as a case insensitive index). I hope someone at mongo considers what Derek suggested. I would love to hear back from someone on this suggestion.

Comment by Derek Wilson [ 28/Jan/15 ]

Hi,

I have been working with a similar use case where I needed to select documents based on a bounded range of values within an array of subdocuments.

This really does seem like a common use case to me and I hope it is fixed quickly.

What I would really like to add to this issue is that the problem didn't pop up for me until much later than I would have expected.

My document structure looked like this during testing:

    {a: [{b: 1}]}
    {a: [{b: 2}]}
    {a: [{b: 3}]}

My index was set on a.b, and i used queries like {a: {$elemMatch: {b: {$gt:1, $lt: 3}}}} which worked fine. The structure was designed to support one or more documents in "a" but mostly there would be only one and on the occasion that it is more than one it is few.

So, the way I first noticed this was with a single shard slowing down as soon as data showed up with more than one subdoc. It wasn't immediately obvious that this was the issue.

It really does kill performance for these types of queries and it makes them very painful.

I do understand the issue of needing to deal with the case where b holds an array of values itself and is also inside an array of subdocs. But i disagree with the idea that you should require different query semantics to handle this case. I also don't see why you would require full schema validation as an alternative. Both of those options seem very heavy weight to me.

it seems like something along the lines of SERVER-3173 seems interesting, but it seems to me like we would just need something to indicate if a field is multikey index is multikey because of the data it contains or because of the structure containing it - which seems like it could just be as simple a flag on any nested index?

And if something like that could work, it also seems like i could/should be able to indicate this at the index level at index creation time... and i shouldn't need schema validation because really what i'm indicating is the way i expect my indexed queries to be planned rather than on what data to enforce. As you point out, you need to plan for worst case scenario by default - but if indicate that i want to exclude handling certain worst cases then i'm also indicating that i'm willing to accept what seems like at worst a subset of correct results in the case that I crap up my data.

Really if its just a planning override at the index level i could also see cases where you have something like

{a: [{b:[1,2]}]}

as a valid option along side my use case and you still only want "b"s where any single value matches all the conditions and you want to use the index.

Comment by Ben Rotz [ 15/Sep/14 ]

Hi Dave,

Your summary captures the whole thing well. I had hoped that this issue would have been moved to a planned version by now. It seems that this would be a common issue, but perhaps my schema is more uncommon than I expected, and I am the only one making noise about this issue.

In addition to the previously suggested fixes, another possibility could be a new operator "$scalarMatch" which works like "$elemMatch" only with scalars (exclusively). Personally I can't imagine a use case in which someone would intentionally create a schema which the same field name stores both scalars and arrays, and even if they did, then also wanting to perform a find/match against it and exclude scalar matches. So I personally prefer having $elemMatch match scalar matches as well as array matches (simpler usage, simpler documentation, etc.) , instead of a new operator (which then requires additional documentation, and people saying "well wtf do I need this for", etc.).

Comment by David Storch [ 02/Sep/14 ]

Hi nefiga,

Thanks for all your thoughts on this! I'll try to respond to all your points above. Please let me know if I missed anything.

If there is anything I can do to get this on someone's radar to address sooner rather than later, please let's do it!

Yes, I'll definitely make sure that this is on the radar. I can't make any promises about a specific timeline or fix version, but it's always good to hear from users about important use cases in order to help us improve the product.

Also, I also realized the problem comes about whenever an array at ANY level gets used in the index.

Yes, currently the query engine only knows whether or not an index is multikey. Since we don't have schema validation, the system always assumes the worst case (i.e. assume anything might be an array). This is necessary in the current system for correctness of query results. I can imagine fixing this and related problems in a more general way if we introduce schema validation down the line.

Instead of having

isMultiKey : true

(which I believe is what is telling the query engine to use a "true" for the lower bound of the e.t index for a range query), you could possibly have

isMultiKey : {'e' : true, 'e.t' : false}

As Asya mentioned above, something like this has already been proposed in SERVER-3173.

So, in short, it seems there are two possible fixes here:

There are definitely a few possibilities for fixing this. Please continue to watch this ticket in order to track further developments.

Best,
Dave

Comment by Asya Kamsky [ 01/Sep/14 ]

I believe the last proposal is related to SERVER-3173 which proposes tracking which indexed field is an array.

Comment by Ben Rotz [ 30/Aug/14 ]

Hi Dave,
Your last suggestion of

Users who maintain the invariant that "a.b" is not an array should have a way of specifying a more efficient query. I am converting this ticket into a request for such a mechanism.

sums up the issue at hand nicely. Using your idea above, let's say we had:

db.test.drop();
db.test.insert({a: [{b: [2, 7]}, {b: [8]}, {b: 8}]});
db.test.insert({a: [{b: [2, 5]}, {b: [8]}, {b: 8}]});
db.test.insert({a: [{b: [2, 7]}, {b: [5]}, {b: 8}]});
db.test.insert({a: [{b: [2, 7]}, {b: [8]}, {b: 5}]});
db.test.ensureIndex({'a.b' : 1});

If we run the query

db.test.find({a: {$elemMatch: {b: {$gt: 3, $lt: 6}}}})

We get all records back, which is correct (although some users may have expected record 1 to be excluded, but we both know, it is technically correct that it is included), but our query does NOT use the index upper bound as it would incorrectly filter out the 1st record.

However, if we run the query

db.test.find({a: {$elemMatch: {b: {$elemMatch: {$gt: 3, $lt: 6}}}}})

We get only records 2 and 3 back (excluding records 1 and 4). But the index is used as desired (having an upper and lower bound). I agree that record 1 should be excluded for this query, but I would argue that record 4 should have been included, as it also opens the door for doing $elemMatch in situations such as mine. And seeing it described in this way makes me agree with 6050 for inclusion. I think what we've done here is basically provide a more detailed analysis of 6050, and hopefully I've shown why it is not only a good idea, but necessary to get indexing to work for the common scenario of indexing an array of objects with a timestamp (or other value that would be used in a "range", but timestamp seems very common). I would put this forth as a possible solution, and I find it to be an urgent fix and send the strongest desires possible over the internet tubes to you to get this set for a fix in 2.7/2.8.

Also, I also realized the problem comes about whenever an array at ANY level gets used in the index. At first I was confused what the difference was between this and just doing a regular range query on a simple index on a collection without any arrays or subdocuments. From what I can tell, when an index is building in Mongo, if any of the index fields are an array, the value for isMultiKey is set to true. This means for an index like

{'e.t' : 1}

, if either "e" or "t" has arrays of size > 1, the isMultiKey gets set to true. That can been seen in the following example:

db.test.drop();
db.test.insert({'t' : [ISODate('2014-08-01 00:00:00')]});
db.test.ensureIndex({'t' : 1});
db.test.find({'t' : {$gte : ISODate('2014-08-01 00:00:00'), $lt : ISODate('2014-08-01 03:00:00')}}).explain();
//note that the range works as desired with upper and lower bounds on the index
db.test.update({'t' : [ISODate('2014-08-01 00:00:00')]}, {'$addToSet' : {'t' : ISODate('2014-08-01 01:00:00')}});
db.test.find({'t' : {$gte : ISODate('2014-08-01 00:00:00'), $lt : ISODate('2014-08-01 03:00:00')}}).explain();
//note that the index no longer has lower bound now that the "t" array has a scenario where the array length > 1... again we expect this, I'm just documenting on why this is the case since it was confusing to me initially
db.test.find({'t' : {$elemMatch : {$gte : ISODate('2014-08-01 00:00:00'), $lt : ISODate('2014-08-01 03:00:00')}}}).explain();
//this gives us the proper upper and lower bounds on the index, and gives us correct results as well because "t" is an array, and not a scalar

The only reason I illustrate the above example was to draw a POSSIBLE second solution to the problem, and that is to change "isMultiKey" to be an array of values to indicate which levels of the index are multikey, instead of just a simple boolean. That is to say, for the data

db.test.drop();
db.test.insert({'e' : [{'t' : ISODate('2014-08-01 00:00:00')}, {'t' : ISODate('2014-08-01 01:00:00')}]});
db.test.insert({'e' : [{'t' : ISODate('2014-08-01 02:00:00')}, {'t' : ISODate('2014-08-01 03:00:00')}]});
db.test.ensureIndex({'e.t' : 1});
db.test.find({'e' : {$elemMatch : {'t' : {$gte : ISODate('2014-08-01 00:00:00'), $lt : ISODate('2014-08-01 02:00:00')}}}}).explain();

Instead of having

isMultiKey : true

(which I believe is what is telling the query engine to use a "true" for the lower bound of the e.t index for a range query), you could possibly have

isMultiKey : {'e' : true, 'e.t' : false}

which I can imagine would address this issue since we could see that the second field of the index ("t") does not utilize an array of size > 1 (or is a scalar). There could perhaps be other ways to store this information, but I believe the example isMultiKey syntax above would store all the necessary information to know that indexes on "e" have a multikey, but indexes on the "t" portion of "e" (as in e.t.) do not.

So, in short, it seems there are two possible fixes here:

  1. Have $elemMatch work against non-array values.
  2. Make isMultiKey a more complex type, such that it can be utilized to interpret the type of information at each document level of the index for better interpretation of complex indexes (like our "e.t" example).

#1 seems like the easiest fix. But is it the best? Part of me thinks that the better answer is #2, but I suppose if someone ever had a situation in which the "t" in the "e.t" example was sometimes an array, and sometimes a scalar (I personally would never model things this way, but I digress), then they would want #1, since having #2 would still pose a potential index problem when doing a range. Good thing there are geniuses that work at 10gen who can help find the best answer. Maybe the best answer is actually to do both suggestions.

If there is anything I can do to get this on someone's radar to address sooner rather than later, please let's do it! I'm not a billionaire (sorry!), but I'd gladly order the whole office pizza and drinks, just email me the office address of where to send it! I would really love if this could make it into the 2.7/2.8 branch.

Thank you!!!

Comment by David Storch [ 29/Aug/14 ]

Hi nefiga,

Thanks for the detailed bug report. As I think you already discovered in your investigations, this is expected behavior. You are right that you shouldn't have to bend over backwards in order to get efficient range queries over nested documents inside an array. However, our query language doesn't yet have a good mechanism for this. Therefore, I'm going to keep this ticket open as a feature request for our query language.

Here's a quick explanation of why both predicates are not used to constrain the index scan in the example above. Suppose you have a collection with index {"a.b": 1}. Every document in the collection has an array "a" containing nested documents with a field "b":

> t.find()
{ "_id" : ObjectId("5400d72583d8a7ffd7b100b9"), "a" : [ { "b" : 1 }, { "b" : 2 }, { "b" : 3 } ] }
{ "_id" : ObjectId("5400d72b83d8a7ffd7b100ba"), "a" : [ { "b" : 4 }, { "b" : 5 }, { "b" : 6 } ] }

When you query on a particular range of "b" values, any indexed plan created by the query planner must not use both predicates to constrain the index scan:

> t.find({"a.b": {$gt: 3, $lt: 6}}).explain()
{
	"cursor" : "BtreeCursor a.b_1",
	"isMultiKey" : true,
	"n" : 1,
	"nscannedObjects" : 2,
         ...
	"indexBounds" : {
		"a.b" : [
			[
				-Infinity,
				6
			]
		]
	},
	"server" : "Macintosh.local:27017",
	"filterSet" : false
}

This is true even if you use an $elemMatch on "a":

> t.find({a: {$elemMatch: {b: {$gt: 3, $lt: 6}}}}).explain()
{
	"cursor" : "BtreeCursor a.b_1",
	"isMultiKey" : true,
	"n" : 1,
	"nscannedObjects" : 2,
	...
	"indexBounds" : {
		"a.b" : [
			[
				-Infinity,
				6
			]
		]
	},
	"server" : "Macintosh.local:27017",
	"filterSet" : false
}

Suppose for argument's sake that the query planner generated a plan which scanned the doubly-open interval (3, 6) for the index {"a.b": 1}. Why is this wrong? Well suppose that we insert a document for which "a.b" is the array [2, 7]:

t.insert({a: [{b: [2, 7]}, {b: 8}]});

This document matches the query: "a" is indeed an array which has a single nested document matching both the conditions that b>3 and b<6. However, if we were to scan the index only for keys on the interval (3, 6), we would miss this document.

This, of course, is only a problem if "a.b" is an array. Users who maintain the invariant that "a.b" is not an array should have a way of specifying a more efficient query. I am converting this ticket into a request for such a mechanism.

Please feel free to reach out with any further questions.

Best,
Dave

Comment by Ben Rotz [ 29/Aug/14 ]

After further discovery, it looks like https://jira.mongodb.org/browse/SERVER-6050 is indeed the answer to this problem, even though it just sounds wrong initially.

I find this is fairly ridiculous and it is not at all clear or obvious by any stretch. The only workaround I can come up with is to take the ticket-mentioned example and make e.t be an array of a single time value, instead of just a scalar time value. That is,

e.t : [ISODate()]

instead of

e.t : ISODate()

Then the query

db.test.find({'e.t' : {'$elemMatch' : {'$gte' : ISODate("2014-08-20T12:00:00.000Z"), '$lt' : ISODate("2014-08-20T13:00:00.000Z")}}}).count();

seems to be both accurate and use the index. However, since I'm actually doing several conditions in the elemMatch, I have to take it one step further and come up with a more complex query. Given the test data:

var randDayMin = 10;
var randDayMax = 20;
var randHrMin = 0;
var randHrMax = 23;
var randValueMin = 0;
var randValueMax = 5;
for (var i = 1; i <= 10000; i++) {
var randDay1 = (Math.floor(Math.random() * (randDayMax - randDayMin + 1)) + randDayMin);
var randHour1 = (Math.floor(Math.random() * (randHrMax - randHrMin + 1)) + randHrMin);
var randDay2 = (Math.floor(Math.random() * (randDayMax - randDayMin + 1)) + randDayMin);
var randHour2 = (Math.floor(Math.random() * (randHrMax - randHrMin + 1)) + randHrMin);
var randValue1 = (Math.floor(Math.random() * (randValueMax - randValueMin + 1)) + randValueMin);
var randValue2 = (Math.floor(Math.random() * (randValueMax - randValueMin + 1)) + randValueMin);
db.test.insert( {x : i, e : [{'t' : [new Date(2014, 07, randDay1, randHour1, 30)], 'v' : randValue1}, {'t' : [new Date(2014, 07, randDay2, randHour2, 30)], 'v' : randValue2}]} );
}

With index

db.test.ensureIndex({'e.t' : 1});

The query to accurately get records with events between 2014-08-20 12 and 2014-08-20 13 that also have a value of 1,2,3 or 4 (excluding 0 and 5), becomes:

db.test.find(
{
  'e' : {
    '$elemMatch' : {
      't' : {
        '$elemMatch' : {
          '$gte' : ISODate("2014-08-20T12:00:00.000Z"),
          '$lt' : ISODate("2014-08-20T13:00:00.000Z")
        }
      },
      'v' : {
        '$in' : [1,2,3,4]
      }
    }
  }
}
).count();

I follow the logic behind all of this, and would agree with it, except for the portion that requires me to put "t" : ISODate() into an array to get the index to work properly just because I happen to be using a range. That part is lame and needs to be fixed ASAP.

I very much look forward to any feedback or suggestions or anything on this issue so I know if this is a priority/concern to anyone else.

From what I gather, a fix to something like 6050 would have my existing schema working properly with the query modification being that apparently the inner $elemMatch is required to get the range to properly use the index. This is a little confusing to me, but hopefully this conversation doesn't die as I'd like to get a solution in place here.

Thanks!!

Comment by Ben Rotz [ 28/Aug/14 ]

I'm just starting to realize how bad this actually is.

If I understand correctly, there is actually NO WAY to effectively use an index to create a bounded range query around a scalar value that resides in an array of objects.

In other words, if I have a structure like

{
    "_id" : ObjectId("53ff9b1065e85b8f7447693b"),
    "x" : 18,
    "e" : [
        {
            "t" : ISODate("2014-08-10T15:30:00Z")
        },
        {
            "t" : ISODate("2014-08-13T13:30:00Z")
        }
    ]
}
{
    "_id" : ObjectId("53ff9b1065e85b8f7447693c"),
    "x" : 19,
    "e" : [
        {
            "t" : ISODate("2014-08-12T08:30:00Z")
        },
        {
            "t" : ISODate("2014-08-13T17:30:00Z")
        }
    ]
}

*1,000,000 more records

There is actually NO WAY to say, get me the records with e.t between 2014-08-13T17:30:00Z and 2014-08-13T18:30:00Z without doing basically a mostly full table scan. Is that correct? I mean at best I pick the datetime closest to either "edge" of my data, and hopefully I'm near an edge (the beginning of time for my data, or the end of time), but if I pick an arbitrary date right in the middle of my data, I'm hosed? No way to have an index that would work in that situation?

Comment by Ben Rotz [ 28/Aug/14 ]

I see this is somewhat of a rehash of this issue: https://jira.mongodb.org/browse/SERVER-8610, which points to https://jira.mongodb.org/browse/SERVER-6050

I think that the recommendation in https://jira.mongodb.org/browse/SERVER-6050 is halfway broken, $elemMatch should NOT match flat data, that makes no sense and I cannot see the reasoning behind that. Basically I agree with the comment by Eduardo Manso in issue 6050. What I can understand is applying a range upper and lower bound to a scalar, which it seems like OP is trying to do. Which is what I am trying to do as well.

This is a major headache for me as it means my main queries in my system are all running much slower than if the range just worked properly as expected. I urgently ask for a prioritization of this issue. If there is any assistance or debugging I can do to push this along, please let me know. Also let me know if there are any acceptable workarounds which do not require manipulation of source data. I understand you can use "hints", to force the index, but currently there is not a way to force "hints" for aggregation framework, so it isn't a full workaround.

Also related to https://jira.mongodb.org/browse/SERVER-7959

Generated at Thu Feb 08 03:36:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.