[SERVER-47925] Add a new $elemMatch-like operator which matches both arrays and scalars Created: 04/May/20 Updated: 07/Apr/23 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Querying |
| Affects Version/s: | 4.0.16 |
| Fix Version/s: | None |
| Type: | New Feature | Priority: | Major - P3 |
| Reporter: | Glen Miner | Assignee: | Backlog - Query Execution |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | qexec-team | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Query Execution
|
||||||||
| Participants: | |||||||||
| Description |
|
Imagine a collection where the field "a" can store either an integer or an array of integers:
Now, suppose that a user wishes to find documents where either "a" is a number in the interval [0, 10], or "a" is an array containing at least one number in the interval [0, 10]. This query should match documents _id:1 and _id:2, but not _id:3. A naive way to write this query would be like so:
As you can see, the semantics of this query are to match all three documents, which was not our intention. This is because there is no requirement that an individual array element satisfy both the $gte and the $lte predicate simultaneously. In order to express this simultaneity for array elements, the match language offers $elemMatch:
Unfortunately, as you can see above, this query is not quite right either. _id:3 no longer matches, which is good, but _id:1 also no longer matches. Recall that the query which we want to express should match _id:1 and _id:2. However, the $elemMatch predicate filters out documents where "a" is not an array. This scenario motivates a feature request. We should have an operator in the match language which is akin to $elemMatch, but also matches scalars:
This hypothetical new operator would also allow the query planner to use tight index bounds over multikey indexes, which should avoid any slow plans with loose index bounds such as those discussed in the original issue report below. Original descriptionWe have a collection of 28 million records that have a IP field that is sometimes a NumberInt (for IPv4), sometimes BinData (for IPv6), and sometimes an array of two elements of either type (we sometimes need to track two addresses). Historically this was a regular index – IP was just a scalar – and way back in those days we had scripts that would run range queries to find people associated with a CIDR block or an ASN routing block. Before we went multi-key these range queries were super fast however today I found that now a range query is so slow I'm not sure it's actually doing anything: db.accounts.count({IP:{$gte:NumberInt(-656117760),$lte:NumberInt(-656116737)}}) I waited for over 5 minutes before aborting it. If I convert this into a dumb loop of 1023 scalar queries it's super fast: var count = 0; for(var i = -656117760; i <= -656116737; ++i) { count += db.accounts.count( {IP:NumberInt(i)}); } print(count); This took about 2 seconds, if that. I double checked with query-explain and it looks like it's still doing an IXSCAN so I can't understand why the smart query should take so long (see attached explain.txt). The only difference seems to be that when it's scanning a multi-key index the index bounds are infinite on one side. I did a quick skim of the 4.2 release notes and didn't see any mention of this being improved - sorry if I missed it. |
| Comments |
| Comment by David Storch [ 14/May/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I'm re-opening this ticket and converting it into a feature request rather than a bug report. I think the scenario of field which can be either a scalar or an array would be reason for us to add a new operator to the $match language whose semantics are $elemMatch-like, but which matches scalars instead of arrays. A similar idea was proposed long ago under | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Carl Champain (Inactive) [ 11/May/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi shaggie76,
In order to correctly answer that query the index scan will still be large: [-656117760, infinity+]. When you create a multikey index on {IP: [1, 2, 3]} there are three index keys: 1, 2, and 3. This means that {$not: {$type: 4} is only applied after the index scan since Mongodb can't know from the index which document is an array or not. That said, the SERVER project is for bugs and feature suggestions for the MongoDB server. As this ticket does not appear to be a bug, I will now close it. If you need further assistance troubleshooting, I encourage you to ask our community by posting on the MongoDB Community Forums or on Stack Overflow with the mongodb tag. Kind regards, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Glen Miner [ 08/May/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Fascinating! I suppose that does make sense and I can understand that changing this behavior is impossible. Something still isn't quite right though, because this works (but is excruciatingly slow):
I just ran a 'time mongo --eval...' for this on a production server with dual-socket Xeon E5-2620 (24 threads) and 128GB of RAM backed by Optane SSDs and it ran for 26 minutes before I lost interest in waiting. In contrast the for-loop took 409ms and returned the right answer.
It's not the $or either: the 'not an array' clause is also frighteningly slow:
So is it possible that it's easier to fix (and justify fixing!) the "not an array" part of the index walk? Or is there some other way to express this query so that it plays nice?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Carl Champain (Inactive) [ 08/May/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We were proposing an $elemMatch-like operator that also matches scalars because we don’t have the tools yet to express the query you want: matching documents where x is a scalar in a given range, or where x is an array that has a scalar in the range. $elemMatch does the latter part, but doesn't match scalars.
{a: {$gte: 0, $lte: 10}} is saying to match documents where a has at least one element greater than or equal to 0 (0, 1, …, 11 will match) AND at least one element lesser than or equal to 0 (0, -1, -2 will match). This means that {a: [-1, 11]} matches. The scan range is [0, to infinity+] and [infinity-, 10], not just [0, 10]. I hope this helps clarify! | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Glen Miner [ 07/May/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I think that would be inconsistent with out exact match queries work and doesn't address the fact that range queries match results they shouldn't. If it were up to me I'd fix range queries so they work like exact-match queries. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Carl Champain (Inactive) [ 07/May/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
shaggie76, thank you for all the info! We could convert this ticket into a new feature request ticket: an $elemMatch-like operator that also matches scalars. What do you think?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Glen Miner [ 06/May/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I realized something was up when I noticed that the brute force way didn't match the range query:
So it occurred to me that exact match was missing array elements or range match might be returning results that don't belong (spoiler: it's returning results that don't belong).
I checked elemMatch and it doesn't work for scalars (as expected)
It does work for array elements of course:
However the multi-key index also has array elements and the same query has matches that don't belong:
What's surprising to me is that an exact match seems to work (and is what we generally do):
So exact match hits scalars and array elements alike (righteous!) but range queries are hitting array elements they shouldn't (bogus!)
Which produces what I believe to be the correct count but my rough timing shows it takes about 10 seconds (again, 10x slower than the for-loop).
In summary:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Carl Champain (Inactive) [ 06/May/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi shaggie76, We think that using $elemMatch in your query should solve the performance issue:
Let us know if it worked! | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Glen Miner [ 05/May/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
To avoid straining our production servers I ran a very slow mongoexport to extract just _id and IP and host it in a developer workstation to gather this information for you. Having 99% of the document data removed does help considerably – roughly speaking the for-loop brute force query takes about 1 second and the single count query takes 22 seconds. I will attach the data and log as requested – please let me know if I have missed something because now that I have a test bench I can run it again easily.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Carl Champain (Inactive) [ 04/May/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi shaggie76, Thanks for the report. Kind regards, |