[SERVER-43989] Regex query not matching correctly on compound index Created: 14/Oct/19  Updated: 28/Oct/19  Resolved: 28/Oct/19

Status: Closed
Project: Core Server
Component/s: Index Maintenance, Querying
Affects Version/s: 3.6.14
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Dimitris Xalatsis Assignee: Eric Sedor
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File compound_regex.json     File compound_regex_restrict.json     File non-multikey.json     File single_field.json    
Issue Links:
Duplicate
duplicates SERVER-29967 The regex filter operation should be ... Backlog
Operating System: ALL
Steps To Reproduce:

*We have a *Measurements collection that consists of documents with the following form:
 

{
 source: <number>,
 sourceId: <string>,
 timestamp: <date>,
 data: { readingType: <string>, value: <number> }
}

Also we have a compound index with the following fields:

{
 "sourceId" : 1,
 "source" : 1,
 "timestamp" : 1,
 "data.readingType" : 1
}

 
Issuing a query with a direct match on the sourceId filter with the value 0013A200418A03F7, yields the expected results and the query completes in ms.

However if the query is run using a regex expression:

db.getCollection('Measurement').find({sourceId: /0013A200418A03F7/});

it takes about 50 seconds to complete on a ~6 million document database.
The query executionStats results file are attached as compound_regex.json

Running the query while restricting the start of the regular expression as this

db.getCollection('Measurement').find({sourceId: /^0013A200418A03F7/});

filters out the documents properly using the index and the query executes quickly.
The executionStats results are attached as compound_regex_restrict.json

Finally just for proof of concept, creating a single field index on the sourceId field, and running the same problematic query again

db.getCollection('Measurement').find({sourceId: /0013A200418A03F7/});

now properly executs as it should.
The executionStats results are attached as single_field.json

Participants:

 Description   

Hi  there,

The problem is that when running a reqex query on a field indexed by a compound index, the IXSCAN stage incorrectly passes all the documents in the collection onto the next FETCH stage, which has to filter them all over again, resulting in a really slow execution of a what had to be a very fast query. 

The strange this is that if we alter the regex query with a ^ or & Regular expression restricting character it works correctly.

Also, if you have a single index using the queried field, issuing the same regex query , works as expected.

Please note that the field is first on the compound index, so it should behave as a single one.

Example details given below as steps to reproduce

 



 Comments   
Comment by Eric Sedor [ 28/Oct/19 ]

No need to apologize d.halatsis@centaur.ag; Please watch that ticket for updates!

Comment by Dimitris Xalatsis [ 26/Oct/19 ]

Hi Eric, 

It turns out you are right about the behavior not being the same when using a non multi key index. I was accidentally looking into my previous executionStats results, where the number of keys in IXSCAN stage returned was the same as the total count of docs in the collection...

I am really sorry for the misreport in my previous comment.

So it seems, it is the same related issue! 

Comment by Eric Sedor [ 25/Oct/19 ]

Happy to help, d.halatsis@centaur.ag,

Unfortunately, the cause of SERVER-29967 is that deduplication is performed in the IXSCAN stage before it applies the covering filter. This deduplication is required in multikey indexes even though the query does not include a filter on the field(s) that made the index multikey.

We can see in non-multikey.json that the filter is being applied in the IXSCAN stage, in contrast to compound_regex.json, where it is being applied in the FETCH stage due to SERVER-29967.

It is certainly true that anchoring a regex is a less time and CPU-intensive operation, so we would expect an unrooted regex to be slower. Can you go into more detail on what you mean by "properly uses the index" if you are seeing something specific?

Comment by Dimitris Xalatsis [ 23/Oct/19 ]

Hi @Eric Sedor, thanks for taking the time to look into this issue.

I saw the issue about the regex query on multikey index but I do not think this is the same issue.

First of all, you are correct about the field data. It is an array containing embedded documents of the form:

{  sensorId: <string>, readingType: <string>, value: <number> }

However, I am not using any of the embedded array document fields in my query. I am only querying on sourceId which is a single string field, so it should not interfere with any of the multi key indexed fields.

Secondly, please note that if I send a regex query defining the start of the regex with this

/^0013A200418A03F7/

it properly uses the index. *The only diference being the ^* character defining the start.

Also to prove the case, I created in a staging copy database an index containing only the non-multikey fields and issued the same query again. The same behavior is observed. So it does not seem to be a matter of the multikey indexes, but a compound vs single key case conflict as I initially stated in the issue.

You can find the executionStats results file  here: non-multikey.json

So I believe it is still valid and not duplicate of the mentioned 29967 issue

Comment by Eric Sedor [ 23/Oct/19 ]

Hi d.halatsis@centaur.ag, thanks for your patience.

The provided explain plans show the index to be a multikey index (isMultiKey: true), and we have SERVER-29967 open to track the optimization of regex performance in this situation. You are right that an IXSCAN could theoretically be used.

That said, the example document you provided doesn't include an array. If it happens to be the case that any data.readingType arrays are unintentional (or could be stored as separate documents without arrays), then you may be able to get improved performance without waiting on SERVER-29967.

In any case, please watch SERVER-29967 for updates.

Gratefully,
Eric

Generated at Thu Feb 08 05:04:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.