[SERVER-47925] Add a new $elemMatch-like operator which matches both arrays and scalars Created: 04/May/20  Updated: 07/Apr/23

Status: Backlog
Project: Core Server
Component/s: Querying
Affects Version/s: 4.0.16
Fix Version/s: None

Type: New Feature Priority: Major - P3
Reporter: Glen Miner Assignee: Backlog - Query Execution
Resolution: Unresolved Votes: 1
Labels: qexec-team
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive diagnostic.data.zip     Text File explain.txt     Zip Archive mongod.zip    
Issue Links:
Related
is related to SERVER-6050 Consider allowing $elemMatch applied ... Closed
Assigned Teams:
Query Execution
Participants:

 Description   

Imagine a collection where the field "a" can store either an integer or an array of integers:

> db.c.find();
{_id: 1, a: 3}
{_id: 2, a: [4, 11]}
{_id: 3, a: [-1, 10]}

Now, suppose that a user wishes to find documents where either "a" is a number in the interval [0, 10], or "a" is an array containing at least one number in the interval [0, 10]. This query should match documents _id:1 and _id:2, but not _id:3. A naive way to write this query would be like so:

> db.c.find({a: {$gte: 0, $lte: 10}});
{_id: 1, a: 3}
{_id: 2, a: [4, 11]}
{_id: 3, a: [-1, 10]}

As you can see, the semantics of this query are to match all three documents, which was not our intention. This is because there is no requirement that an individual array element satisfy both the $gte and the $lte predicate simultaneously. In order to express this simultaneity for array elements, the match language offers $elemMatch:

> db.c.find({a: {$elemMatch: {$gte: 0, $lte: 10}}});
{_id: 2, a: [4, 11]}

Unfortunately, as you can see above, this query is not quite right either. _id:3 no longer matches, which is good, but _id:1 also no longer matches. Recall that the query which we want to express should match _id:1 and _id:2. However, the $elemMatch predicate filters out documents where "a" is not an array. This scenario motivates a feature request. We should have an operator in the match language which is akin to $elemMatch, but also matches scalars:

> db.c.find({a: {$inventedNewKindOfElemMatch: {$gte: 0, $lte: 10}}});
{_id: 1, a: 3}
{_id: 2, a: [4, 11]}

This hypothetical new operator would also allow the query planner to use tight index bounds over multikey indexes, which should avoid any slow plans with loose index bounds such as those discussed in the original issue report below.

Original description

We have a collection of 28 million records that have a IP field that is sometimes a NumberInt (for IPv4), sometimes BinData (for IPv6), and sometimes an array of two elements of either type (we sometimes need to track two addresses).

Historically this was a regular index – IP was just a scalar – and way back in those days we had scripts that would run range queries to find people associated with a CIDR block or an ASN routing block. Before we went multi-key these range queries were super fast however today I found that now a range query is so slow I'm not sure it's actually doing anything:

db.accounts.count({IP:{$gte:NumberInt(-656117760),$lte:NumberInt(-656116737)}})

I waited for over 5 minutes before aborting it.

If I convert this into a dumb loop of 1023 scalar queries it's super fast:

var count = 0; for(var i = -656117760; i <= -656116737; ++i) { count += db.accounts.count(

{IP:NumberInt(i)}

); } print(count);

This took about 2 seconds, if that.

I double checked with query-explain and it looks like it's still doing an IXSCAN so I can't understand why the smart query should take so long (see attached explain.txt). The only difference seems to be that when it's scanning a multi-key index the index bounds are infinite on one side.

I did a quick skim of the 4.2 release notes and didn't see any mention of this being improved - sorry if I missed it.



 Comments   
Comment by David Storch [ 14/May/20 ]

I'm re-opening this ticket and converting it into a feature request rather than a bug report. I think the scenario of field which can be either a scalar or an array would be reason for us to add a new operator to the $match language whose semantics are $elemMatch-like, but which matches scalars instead of arrays. A similar idea was proposed long ago under SERVER-6050 and duplicate SERVER-5320. This idea was rejected, however, since it would be a major breaking change. There is nothing preventing us from adding a new concept to the language in a non-breaking fashion, however, whose semantics are like an $elemMatch that also matches scalars. I searched the backlog but couldn't find a pre-existing request for this.

Comment by Carl Champain (Inactive) [ 11/May/20 ]

Hi shaggie76,

db.accounts.count({IP: {$not:{$type:4},$gte:NumberInt(-656117760),$lte:NumberInt(-656116737)}})

In order to correctly answer that query the index scan will still be large: [-656117760, infinity+]. When you create a multikey index on {IP: [1, 2, 3]} there are three index keys: 1, 2, and 3. This means that {$not: {$type: 4}  is only applied after the index scan since Mongodb can't know from the index which document is an array or not.

That said, the SERVER project is for bugs and feature suggestions for the MongoDB server. As this ticket does not appear to be a bug, I will now close it. If you need further assistance troubleshooting, I encourage you to ask our community by posting on the MongoDB Community Forums or on Stack Overflow with the mongodb tag.

Kind regards,
Carl

Comment by Glen Miner [ 08/May/20 ]

Fascinating! I suppose that does make sense and I can understand that changing this behavior is impossible.

Something still isn't quite right though, because this works (but is excruciatingly slow):

 

> db.accounts.count({
    $or : [
        {IP: {$not:{$type:4},$gte:NumberInt(-656117760),$lte:NumberInt(-656116737)}},
        {IP: {$elemMatch: {$gte:NumberInt(-656117760),$lte:NumberInt(-656116737)}}}
    ]
})

I just ran a 'time mongo --eval...' for this on a production server with dual-socket Xeon E5-2620 (24 threads) and 128GB of RAM backed by Optane SSDs and it ran for 26 minutes before I lost interest in waiting. In contrast the for-loop took 409ms and returned the right answer.

 

It's not the $or either: the 'not an array' clause is also frighteningly slow:

 

> db.accounts.count({IP: {$not:{$type:4},$gte:NumberInt(-656117760),$lte:NumberInt(-656116737)}})

So is it possible that it's easier to fix (and justify fixing!) the "not an array" part of the index walk? Or is there some other way to express this query so that it plays nice?

 

 

 

Comment by Carl Champain (Inactive) [ 08/May/20 ]

shaggie76,

We were proposing an $elemMatch-like operator that also matches scalars because we don’t have the tools yet to express the query you want: matching documents where x is a scalar in a given range, or where x is an array that has a scalar in the range. $elemMatch does the latter part, but doesn't match scalars. 

 

Bounded range queries ($gte,$lte) return results for arrays that don't contain a matching element

{a: {$gte: 0, $lte: 10}} is saying to match documents where a has at least one element greater than or equal to 0 (0, 1, …, 11 will match) AND at least one element lesser than or equal to 0 (0, -1, -2 will match). This means that {a: [-1, 11]} matches. The scan range is [0, to infinity+] and [infinity-, 10], not just [0, 10].

I hope this helps clarify!
Carl

Comment by Glen Miner [ 07/May/20 ]

I think that would be inconsistent with out exact match queries work and doesn't address the fact that range queries match results they shouldn't. If it were up to me I'd fix range queries so they work like exact-match queries.

Comment by Carl Champain (Inactive) [ 07/May/20 ]

shaggie76, thank you for all the info!

We could convert this ticket into a new feature request ticket: an $elemMatch-like operator that also matches scalars. What do you think?

 

 

Comment by Glen Miner [ 06/May/20 ]

I realized something was up when I noticed that the brute force way didn't match the range query:

> var count = 0; for(var i = -656117760; i <= -656116737; ++i) { count += db.accounts.count( {IP:NumberInt(i)} ); } print(count);
49 
> db.accounts.count({IP:{$gte:NumberInt(-656117760),$lte:NumberInt(-656116737)}})
2971

So it occurred to me that exact match was missing array elements or range match might be returning results that don't belong (spoiler: it's returning results that don't belong).
 
I checked if it was counting all arrays blindly, but it turns out that isn't it:

> db.accounts.count({IP:{$type:4}})
23024

I checked elemMatch and it doesn't work for scalars (as expected)

> db.accounts.count({IP: {$elemMatch: {$gte:NumberInt(-656117760),$lte:NumberInt(-656116737)}}})
0

It does work for array elements of course:

> db.accounts.find({IP: {$elemMatch: {$gte:NumberInt(782650415),$lte:NumberInt(782650417)}}})
{ "_id" : ObjectId("xxx"), "IP" : [ 782650416, -1338277100 ] }

 
 The standard range-query works for scalar matches which makes sense:
 

> db.accounts.find({IP:{$gte:NumberInt(-656117760),$lte:NumberInt(-656116737)}},{IP:1}).limit(1).pretty()
{ "_id" : ObjectId("xxx"), "IP" : -656117739 }  

However the multi-key index also has array elements and the same query has matches that don't belong:

> db.accounts.find({IP:{$gte:NumberInt(-656117760),$lte:NumberInt(-656116737)}},{IP:1}).skip(1000).limit(1).pretty()
{
        "_id" : ObjectId("xxx"),
        "IP" : [
                782650416,
                -1338277100
        ]
}

What's surprising to me is that an exact match seems to work (and is what we generally do):

> db.accounts.find({IP:NumberInt(782650416)},{IP:1}).pretty()
{ "_id" : ObjectId("xxx"), "IP" : 782650416 }
{ "_id" : ObjectId("xxx"), "IP" : 782650416 }
{ "_id" : ObjectId("xxx"), "IP" : 782650416 }
{
        "_id" : ObjectId("xxx"),
        "IP" : [
                782650416,
                -1338277100
        ]
}
{ "_id" : ObjectId("xxx"), "IP" : 782650416 }
 

So exact match hits scalars and array elements alike (righteous!) but range queries are hitting array elements they shouldn't (bogus!)
 
I tried being explicit about this:

> db.accounts.count({
    $or : [
        {IP: {$not:{$type:4},$gte:NumberInt(-656117760),$lte:NumberInt(-656116737)}},
        {IP: {$elemMatch: {$gte:NumberInt(-656117760),$lte:NumberInt(-656116737)}}}
    ]
})
49

Which produces what I believe to be the correct count but my rough timing shows it takes about 10 seconds (again, 10x slower than the for-loop).
 
And just to make sure I'm not crazy, I made sure there's an index:

> db.accounts.getIndexes()
[
        {
                "v" : 2,
                "key" : {
                        "_id" : 1
                },
                "name" : "_id_",
                "ns" : "index_test.accounts"
        },
        {
                "v" : 2,
                "key" : {
                        "IP" : 1
                },
                "name" : "IP_1",
                "ns" : "index_test.accounts",
                "sparse" : true
        }
]

In summary:

  1. Exact match and elemMatch queries work as expected
  2. Bounded range queries ($gte,$lte) return results for arrays that don't contain a matching element
  3. 1023 exact-match querie still run about 10x faster than a $or with elemMatch

 
  
 
 
-g

Comment by Carl Champain (Inactive) [ 06/May/20 ]

Hi shaggie76,

We think that using $elemMatch in your query should solve the performance issue:

 db.accounts.find({IP: {$elemMatch: {$gte:NumberInt(-656117760),$lte:NumberInt(-656116737)}}})

Let us know if it worked!
Carl

Comment by Glen Miner [ 05/May/20 ]

To avoid straining our production servers I ran a very slow mongoexport to extract just _id and IP and host it in a developer workstation to gather this information for you.

Having 99% of the document data removed does help considerably – roughly speaking the for-loop brute force query takes about 1 second and the single count query takes 22 seconds.

I will attach the data and log as requested – please let me know if I have missed something because now that I have a test bench I can run it again easily.

 

Comment by Carl Champain (Inactive) [ 04/May/20 ]

Hi shaggie76,

Thanks for the report.
Would you please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) covering this issue? I've created a secure upload portal for you. Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Kind regards,
Carl
 

Generated at Thu Feb 08 05:15:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.