[SERVER-18861] Queries matching null value should be fully covered by index Created: 08/Jun/15  Updated: 05/Dec/22

Status: Backlog
Project: Core Server
Component/s: Index Maintenance, Querying
Affects Version/s: 3.0.3
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Andrey Hohutkin Assignee: Backlog - Query Optimization
Resolution: Unresolved Votes: 19
Labels: index-version, qopt-team, storch
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-12869 Index null values and missing values ... Backlog
Duplicate
is duplicated by SERVER-29326 count query with null criterion can't... Closed
is duplicated by SERVER-49845 allow indices to optionally cover mis... Closed
is duplicated by SERVER-36990 Indexed filed still exists in FETCH step Closed
Related
related to SERVER-5065 a missing field nested within an arra... Closed
related to SERVER-27646 covered index should be used when nul... Closed
related to SERVER-12869 Index null values and missing values ... Backlog
related to SERVER-18653 Answering "equality to null" predicat... Closed
is related to SERVER-55065 Null queries should be covered by ind... Closed
Assigned Teams:
Query Optimization
Sprint: Query Optimization 2021-02-22, Query Optimization 2021-03-08
Participants:
Case:

 Description   

Because of issue closing SERVER-18653 I open a feature request for the important improvement.

Assume, we have a collectoin with documents like

{}, {a:1}, {a:[]}, {a:null}

Also we have an non sparse index on "a" field.

If we query all documents where a doesnt exist or null with {a: null}, an engine always fetches documents even having field indexed.

Need to improve query planner to allow query null valued field and to build an index which can detect collisions with array values.



 Comments   
Comment by Alya Berciu [ 27/Oct/21 ]

fuat.sungur if the index is not multikey, we use a count scan (no fetch stage) for a count on an indexed field with {$eq: null} in 4.9.0 and above. So, for example, for a non-multikey index on field a, we can avoid a fetch stage for a query like the following:

db.c.count({a: {$eq: null}})

In this instance, instead of fetching the documents, we use two count scans joined by an OR stage to match both null and undefined.

Comment by Fuat Sungur [ 24/Oct/21 ]

alya.berciu, we have the similar problem with a customer that, $eq:null will cause unnecessary FETCH, however they are on version 4.4. 

Can we confirm that with the implementation of SERVER-55065, with the version 4.9.0; we are utilizing the index in the case $eq:null; without fetch stage, if there is no array value in the field? Then we can ask them to upgrade (they are on Atlas). 

Comment by Denys Isaiev [ 04/Oct/21 ]

Thanks, mongo upgrade indeed resolved our issue

Comment by Alya Berciu [ 16/Sep/21 ]

isaevdan@gmail.com the first option (for non-multikey indexes) is implemented in versions 4.9.0 and above (see SERVER-55065).

Comment by Denys Isaev [ 16/Sep/21 ]

Thanks for replies

So as I understand there are two options:

  • for not 'multikey' index is to implement optimization to not do fetch for count (as I understand this is not implemented yet? I tried to execute commands from my initial comment on brand new collection without inserting [] value, and still end up with fetch phase)
  • even if the index is 'multikey' implement optimization for predicate which includes both [] and null to not do fetch and thus give user a little bit of control about this (having in mind he knows whether there are cases with [] in his particular collection)

My question would be then if it is possible to implement any of the optimizations above? 

In general what is the recommendation to workaround such cases of filtering with null value?

Comment by Alya Berciu [ 15/Sep/21 ]

isaevdan@gmail.com , charlie.swanson is correct: in your first example, we don’t use a COUNT_SCAN stage, nor do we just use an IXSCAN stage without a FETCH, because the index is multikey and therefore cannot differentiate between the [], null, and missing values for the indexed field ‘Value’.

For your second example, the execution plan would be using COUNT_SCANs with no FETCH stage if your index is not multikey. Your index may be multikey if at some point a document in your collection had a 'Value' field that was an array.

To elaborate on Charlie's explanation, because the index itself stores null, missing, and [] in the same way, if we counted what the index scan returns for this predicate, we would also count empty arrays, which is incorrect. That’s why we need to fetch the document first, so we can see the actual value of ‘Value’, which based on the index would include any documents whose ‘Value’ field is either null, missing, or [], and then filter out documents where the ‘Value’ field is [].

Therefore we wouldn’t be able to eliminate the filter stage at all for this specific predicate on a multikey index, though we should be able to do it for a predicate which includes empty arrays (I don't think we currently do this, however).

Comment by Denys Isaev [ 15/Sep/21 ]

I didn't know that array value marks index as 'multikey', and added that values just to illustrate that points(2,4) from this comment do not apply https://jira.mongodb.org/browse/SERVER-18653?focusedCommentId=931817&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-931817 (I don't know why links is no visible in initial comment). In fact I do have only  2 cases

{ Value: null } { Value: \{ _id: someid and other date}

}

 

I don't even have cases without field Value in place. And when I try to count all records with { Value : null } I end up with fetch stage and slow query

So can this case be solved with engine optimizations?

Comment by Charlie Swanson [ 15/Sep/21 ]

isaevdan@gmail.com I'd be interested in alya.berciu's opinion as well, but it seems to me that we could not eliminate the fetch stage in your example due to the Value: [] document which makes the index "multikey". This "multikey" tag simply means there was at least one array value inserted, so there are potentially multiple index keys per document. Once the index is multikey, it is not eligible for a COUNT_SCAN because there could be multiple index entries for a single document which would incorrectly inflate the count.

In your example that is clearly not the case, but the system metadata does not currently distinguish between you inserting an empty array like you do here vs a full array like Value: [3, 4, 5] or Value: [null, null]. We would not want to count each of 3, 4, 5 separately nor would we want to count both of the nulls separately. See also SERVER-5580 which describes why we have to consider empty or singleton arrays as multikey even though they do not generate multiple keys.

We might be able to get rid of the filter on Value: {$eq: null} in this case, but I'm not optimistic that it would have a dramatic impact on query performance.

Comment by Denys Isaev [ 15/Sep/21 ]

Hi, I have a very specific case and went though all of the discussion around this topic and I think that this case might be optimized in engine without any new features. Please correct me if I'm wrong.

My case is that I have a following collection:

db.test.insert({Value: []})
db.test.insert({Value: null})
db.test.insert({Value: 1})
db.test.insert({})

and index

db.test.createIndex({Value: 1})

I execute query

db.test.count({Value: null})

From query plan I see that there is fetch stage with filter Value: {$eq : null}

As I understand there is no reason for this, since I don't need to know which case exactly is there(missing field or null value) for count operation. Also since this is a root level field points 2 and 4 from this comment do not apply here. If I'm right is that possible to implement optimization for count operation on root level fields? 

Comment by Alya Berciu [ 09/Mar/21 ]

CC anton.korshunov

Until the new index format is available, this issue cannot be fully fixed. In the meantime, we are using another ticket to track some tactical improvements: https://jira.mongodb.org/browse/SERVER-55065.

Comment by Charlie Swanson [ 23/Feb/21 ]

alya.berciu as we discussed during standup, this optimization may still make sense if the index is not "multikey" and has never seen any array values inserted. It becomes more fragile of an optimization, but I think still worthwhile.

Assigning this back to you to take another look to make sure that works.

Comment by Ronaldo Moura [ 11/Aug/20 ]

Any plans to solve this issue ?

Any recommend workaround on this issue ?

We have skip and count() commands using a: null, and they are loading all documents, ignoring the existent index

Tks.

Comment by Daniel Playfair Cal [ 09/Apr/20 ]

OK, perhaps only the possibility of a missing field is enough to cause the problem.

 

Let me take a step back and try to explain better what I want. I have a collection with documents matching the following schema:

{
  _id: ObjectId,
  a: ObjectId | null,
}

I want to query for documents where a is null. Note that I already know that `a` is either `null` or an ObjectId, even if Mongo doesn't know.

 

To do this I built the following index:

{
  a: 1,
  _id: 1,
}

If I do the simple query:

find({a: null}, {_id: 1})

Then there is an unecessary FETCH stage with a filter for `a: null`, even though the field `a` is not in the projection. Also, in circumstances where there are more fields and other indexes, mongo will sometimes choose to use an index which is much less ideal, which causes a much more serious performance problem than a non covered query.

So my question is, can I somehow change my query so that it matches the semantics of the index and it can be covered? I have chosen a projection that does not include the `a` field, so there should be no need for mongo to fetch anything from disk to return that field because I have not asked it to return that field at all.

For example, given that missing values are included in null index keys, I could amend my query to also accept missing values:

find({'$or': [{a: null}, {a: {'$exists': false}}]}, {_id: 1})

But this still results in a FETCH stage.

Based on this comment, mongo also creates null index keys for the empty array. So I attempted to amend my query again:

find({'$or': [{a: null}, {a: {'$exists': false}}, {a: []}]}, {_id: 1})

But again, there is still an unecessary FETCH stage with a filter for `a` similar to my query, and mongo still does not always prefer to use the index on `a` which is I know is best suited to the query.

 

Is is possible to create a query that allows mongo to cover it an index by explicitly filtering for all the values that might generate a null key in the index? If so, how can I find what these possible values are?

Comment by Asya Kamsky [ 03/Apr/20 ]

To summarize:

When collection has an index on a and two documents:

{_id:1}
{_id:2, a:null}

The query {{

{a:null}

}} with projection {{

{a:1, _id:0}

}} needs to match both the documents (which it can do from the index) but correct values for field a can not be determined from the index and the documents must be fetched.

Comment by Asya Kamsky [ 03/Apr/20 ]

> mismatch between the semantics of the query 
{a: null} and the index:
{a: null}

There is not a mismatch here - the issue is that both match null and missing field. However to return that field value, the document has to be examined.

In other words, if you query

 db.coll.find({a:null},{_id:0, a:1}) 

and there is an index on a we don't know whether to return

 { } 

or

 { a: null} 

only from the index, without examining the actual document.

Index can fully answer the query {a:null} but it cannot cover the projection for the field from the index alone.

Comment by Daniel Playfair Cal [ 03/Apr/20 ]

I have just done some experiments on the latest version (4.2.5) and so far I've failed to find a workaround. I understand that the issue occurs because there is a mismatch between the semantics of the query {a: null} and the index: {a: null}. Some workarounds I tried that did not work (under the assumption that the key a contains either null or a number):

 

querying in a way that matches the semantics of the index:

{a: '$in': [null, []]}

It's not clear to me why, but this still results in a FETCH+filter stage that filters even after the index is used to select documents indexes as {a: null}

 

Using a partial index:

createIndex({a: 1}, {partialFilterExpression: {a: null}})

or alternatively (with a matching query): createIndex({a: 1}, {partialFilterExpression: {a:

{'$type': 10}

}})

 

None of these workarounds work. Is there any alternative I'm not seeing here to use covered queries involving null values?

Comment by Andrey Hvetskovich [ 18/Sep/19 ]

Any plans on resolving this issue?

Comment by Kay Agahd [ 05/Jun/19 ]

so you may add "depends on SERVER-12869"

Comment by David Storch [ 05/Jun/19 ]

asya, the meaning of $eq:null predicates is to match documents where the path is either missing or contains a literal null value. Therefore, I don't think it makes sense to resolve this as a duplicate of SERVER-12869. They are related, but not precisely the same. Additionally, this ticket describes the problem whereas SERVER-12869 describes a particular set of code changes that wouldn't necessarily resolve this problem. I'm reopening this ticket.

Comment by Asya Kamsky [ 03/Jun/19 ]

This issue can only be resolved by implementing/resolving SERVER-12869 since current version of MongoDB indexes stores the same value for null and missing values, and to be able to cover queries that include $eq to null predicate clause and return that field, the index would have to store different values for the two.

Closing as a duplicate.

Comment by Ramon Fernandez Marina [ 09/Oct/15 ]

andrey.hohutkin@gmail.com, I realized we moved this ticket into "Needs Triage", but we never posted a public comment: this ticket will be considered in a future planning round. There's some ongoing internal discussion about semantics and their implications for existing apps. Updates will be posted to this ticket as they happen.

Regards,
Ramón.

Generated at Thu Feb 08 03:48:58 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.