[SERVER-964] Get $elemMatch to correctly use the index on the array Created: 05/Apr/10 Updated: 07/Mar/14 Resolved: 06/Apr/10 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Querying |
| Affects Version/s: | 1.3.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Loïc d'Anterroches | Assignee: | Eliot Horowitz (Inactive) |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
$ uname -a |
||
| Attachments: |
|
| Participants: |
| Description |
|
Hello, I am using MongoDB to store at the moment 100,000 documents. Each document can have from 1 to 150 properties. I need to search by property range for any combination of properties. On average a document has 5 properties set. Here is for example 2 documents, the first has properties 1, 2 and 17 set and doc2 has properties 1, 3 and 45. doc1 = {prop1: 123.1, prop2: 345.2, prop17: 12.0}doc2 = {prop1: 124.1, prop3: 33455.2, prop45: 11232.0}The kind of queries needed are: db.collection.find({prop1: {$gt: 123.0, $lt: 250.0}, prop3: {$gt: 10.0}}); It is not possible to create 150 indexes, one for each property, so Mathias (mstearn) suggested to store the properties this way with a single index on p: doc1 = {p: [ {prop1: 123.1}, {prop2: 345.2}, {prop17: 12.0}]} , {prop3: 33455.2}, {prop45: 11232.0}]} And to query this way: db.testarray.find({"p": {$elemMatch: {prop1: {$gt: 123.0, $lt: 250.0}, prop3: {$gt: 10.0}}}}); But in that case, the index on p is not used and the complete collection is scanned. See the attached javascript setting up a small 10000 document collection with some test data. |
| Comments |
| Comment by Loïc d'Anterroches [ 09/Apr/10 ] |
|
Aaron, yeah! I was puzzled, I have now 10 different ways to store the data and index it and I was wondering why some queries where faster than others with the $all keyword. Now, I can use my knowledge of the data to write my queries in an order which will match the less available properties first. Really, really good tip. For the compound index, I totally forgot about it, thanks for the reminder, now back to the tests. This discussion reveals one clear thing: know and analyze your dataset if you want to take the maximum out of MongoDB. |
| Comment by Aaron Staple [ 08/Apr/10 ] |
|
Hi Loic, If you want to look up quickly on both key and val, you can create a compound index like this {'p.key':1,'p.val':1}. (Creating separate indexes on key and val won't help, as you've mentioned.) Also, I should point out that when you use $all, index matching is currently just performed on the first element in the $all array spec. So all documents matching that first $all element are checked to see if they match. |
| Comment by Loïc d'Anterroches [ 07/Apr/10 ] |
|
Aaron, thanks a lot. I keep commenting here as it may be of interest for other people. In the particular example of Aaron, even if you add a "p.value" index, it is not used. This means that the search will scan all the document having the given property. This is good and bad. The good point, when you have a very sparse "density of properties", you can basically cut the search tremendously and this is my case, except for some of the properties. The bad point is that if you have one property which is always set for your 26 millions documents, this is not going to work. For example the weight of a molecule as it is always known. The solution: Put all the properties with a very low count in the storage proposed by Aaron and use dedicated indexes for the high density properties. Be smart to rebalance the indexes over the time depending on the usage patterns and the data in the db. This is really great because it means no special hack! Thanks a lot! |
| Comment by Aaron Staple [ 07/Apr/10 ] |
|
So, just in case you try the key/val method I described above you can put several $elemMatch specs inside an $all in order to query multiple properties simultaneously. |
| Comment by Loïc d'Anterroches [ 06/Apr/10 ] |
|
Thanks a lot for the time you took to answer. I will check this memcmp format, it may be the most efficient way at the end as I will not have to store the name of the property as a key in the array and the indexing will be efficient on the "missing properties". Really thank you, I will keep you informed. |
| Comment by Eliot Horowitz (Inactive) [ 06/Apr/10 ] |
|
one options is munging key and value so before then you could only do alpha compares. another option is to use a binary type or just a binary array use the first 4 bytes as the key, and then put your data in a memcmp format. |
| Comment by Aaron Staple [ 06/Apr/10 ] |
|
An alternative data model you could use is this: {p: [ {key:"prop1",val:123.4}, {key:"prop2",val:555}]} then make an index on {'p.key':1}and query like {p:{$elemMatch:{key:"prop1",val:{$gt:1}}}. The downside is that you can only query for one property type at a time. |
| Comment by Loïc d'Anterroches [ 06/Apr/10 ] |
|
Ok, so do I understand that one cannot have 100 independent properties for a document and then search in two of them with greater/lower bounds? I tried all the combination of storage: 1. {prop1: 123.4, prop2: 234.5, ...}2. {p: [ {prop1: 123.4}, {prop2: 234.5}, ...]} ]} Without success in getting the index to be used. In fact, it only works in case 1 when I index a subset of the 100 properties. I will have a close look at what my users are looking at and try to index the most frequently used queries and if I really need more, I will recompile mongoDB to bump up the index limit. I know it will be bad for the insert time on the given collection but in that particular case it is not a problem. |
| Comment by Eliot Horowitz (Inactive) [ 06/Apr/10 ] |
|
ok - closing since doesn't really make sense |
| Comment by Loïc d'Anterroches [ 06/Apr/10 ] |
|
In my particular case, the way the data is stored is not a problem, I can also store the way Aaron proposed. As long as my min/max query runs against the index, I am happy. Also note that in my case, I insert/update the data once a month and query 1000's a day, so time to insert is not an issue. |
| Comment by Aaron Staple [ 06/Apr/10 ] |
|
I don't think this works in general. An index on 'p' doesn't necessarily tell us where to find 'x' values of interest: db.c.find( {p: {$elemMatch: {x:{$gt:1}}}} ) we may have documents like ]} ]} |