[SERVER-43989] Regex query not matching correctly on compound index Created: 14/Oct/19 Updated: 28/Oct/19 Resolved: 28/Oct/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Index Maintenance, Querying |
| Affects Version/s: | 3.6.14 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Dimitris Xalatsis | Assignee: | Eric Sedor |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
|||||||||||||||
| Issue Links: |
|
|||||||||||||||
| Operating System: | ALL | |||||||||||||||
| Steps To Reproduce: | *We have a *Measurements collection that consists of documents with the following form:
Also we have a compound index with the following fields:
However if the query is run using a regex expression:
it takes about 50 seconds to complete on a ~6 million document database. Running the query while restricting the start of the regular expression as this
filters out the documents properly using the index and the query executes quickly. Finally just for proof of concept, creating a single field index on the sourceId field, and running the same problematic query again
now properly executs as it should. |
|||||||||||||||
| Participants: | ||||||||||||||||
| Description |
|
Hi there, The problem is that when running a reqex query on a field indexed by a compound index, the IXSCAN stage incorrectly passes all the documents in the collection onto the next FETCH stage, which has to filter them all over again, resulting in a really slow execution of a what had to be a very fast query. The strange this is that if we alter the regex query with a ^ or & Regular expression restricting character it works correctly. Also, if you have a single index using the queried field, issuing the same regex query , works as expected. Please note that the field is first on the compound index, so it should behave as a single one. Example details given below as steps to reproduce
|
| Comments |
| Comment by Eric Sedor [ 28/Oct/19 ] | ||
|
No need to apologize d.halatsis@centaur.ag; Please watch that ticket for updates! | ||
| Comment by Dimitris Xalatsis [ 26/Oct/19 ] | ||
|
Hi Eric, It turns out you are right about the behavior not being the same when using a non multi key index. I was accidentally looking into my previous executionStats results, where the number of keys in IXSCAN stage returned was the same as the total count of docs in the collection... I am really sorry for the misreport in my previous comment. So it seems, it is the same related issue! | ||
| Comment by Eric Sedor [ 25/Oct/19 ] | ||
|
Happy to help, d.halatsis@centaur.ag, Unfortunately, the cause of SERVER-29967 is that deduplication is performed in the IXSCAN stage before it applies the covering filter. This deduplication is required in multikey indexes even though the query does not include a filter on the field(s) that made the index multikey. We can see in non-multikey.json It is certainly true that anchoring a regex is a less time and CPU-intensive operation, so we would expect an unrooted regex to be slower. Can you go into more detail on what you mean by "properly uses the index" if you are seeing something specific? | ||
| Comment by Dimitris Xalatsis [ 23/Oct/19 ] | ||
|
Hi @Eric Sedor, thanks for taking the time to look into this issue. I saw the issue about the regex query on multikey index but I do not think this is the same issue. First of all, you are correct about the field data. It is an array containing embedded documents of the form:
However, I am not using any of the embedded array document fields in my query. I am only querying on sourceId which is a single string field, so it should not interfere with any of the multi key indexed fields. Secondly, please note that if I send a regex query defining the start of the regex with this
it properly uses the index. *The only diference being the ^* character defining the start. Also to prove the case, I created in a staging copy database an index containing only the non-multikey fields and issued the same query again. The same behavior is observed. So it does not seem to be a matter of the multikey indexes, but a compound vs single key case conflict as I initially stated in the issue. You can find the executionStats results file here: non-multikey.json So I believe it is still valid and not duplicate of the mentioned 29967 issue | ||
| Comment by Eric Sedor [ 23/Oct/19 ] | ||
|
Hi d.halatsis@centaur.ag, thanks for your patience. The provided explain plans show the index to be a multikey index (isMultiKey: true), and we have SERVER-29967 open to track the optimization of regex performance in this situation. You are right that an IXSCAN could theoretically be used. That said, the example document you provided doesn't include an array. If it happens to be the case that any data.readingType arrays are unintentional (or could be stored as separate documents without arrays), then you may be able to get improved performance without waiting on SERVER-29967. In any case, please watch SERVER-29967 for updates. Gratefully, |