[SERVER-29865] Make regex use collation index Created: 26/Jun/17 Updated: 27/Oct/23 Resolved: 30/Jun/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Querying |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | dane truelson | Assignee: | Backlog - Query Team (Inactive) |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Assigned Teams: |
Query
|
| Participants: |
| Description |
|
Case insensitive indexes should be used when running a case insensitive "starts with" regex like this /^searchterm/i Currently if you make a case insensitive collation and run a "starts with" regex query like this /^searchterm/i, mongodb will not use the index. This is a real problem for any application that uses an autocomplete feature. |
| Comments |
| Comment by Tess Avitabile (Inactive) [ 23/Aug/18 ] | ||||||||||||||||||||||||||||||||||||||
|
Yes, please open a separate ticket. | ||||||||||||||||||||||||||||||||||||||
| Comment by David Mumladze [ 23/Aug/18 ] | ||||||||||||||||||||||||||||||||||||||
|
Thanks for clarifying. That makes sense as collation order needs to be considered, which means this approach wouldn't be suitable for collections with data in multiple languages, so I'm abandoning this approach for now. I'm also trying another option, where all names are transformed to lowercase and pushed to an array, which is searched using the regex prefix:
Some observations
More on sorting.. Index is defined as:
and my query is:
.Sorting also didn't work with range query. Btw, if I create an auxiliary index as:
then that index is used and results in full scan. Should I open a separate case? | ||||||||||||||||||||||||||||||||||||||
| Comment by Tess Avitabile (Inactive) [ 23/Aug/18 ] | ||||||||||||||||||||||||||||||||||||||
|
Hi David, When the prefix ends with 'z', you can "increment" previous letter of the prefix in the upper bound, for example:
If the prefix is all 'z's, then you can leave off the upper bound:
This assumes that all of the characters in the names are alphabetical. I am not certain why you are not observing "souza" >= "souz" && "souza" < "souª" in MongoDB, but I would guess it is because 'ª' does not follow 'z' in the collation that you are using. Best, | ||||||||||||||||||||||||||||||||||||||
| Comment by David Mumladze [ 23/Aug/18 ] | ||||||||||||||||||||||||||||||||||||||
|
Hi Tess, Could you please explain how range approach may work for below scenario? There are last names that have 'z' in them, so if the user is searching for 'souz' prefix, then the next higher character after 'z' would be '{', which is not working, so I tried a char regarded as letter 'ª', but it still didn't work. Query came out as
Also, run this in any program and it'll return true, but I'm puzzled as to why MongoDB cannot do it.
Collection collation is
with corresponding index on lname field. Thanks, | ||||||||||||||||||||||||||||||||||||||
| Comment by dane truelson [ 01/Jul/17 ] | ||||||||||||||||||||||||||||||||||||||
|
It works! Thank you! Here is my code to add the proper search criteria to the find object in case anybody else wants it. I just made it today, so it is not guaranteed to be bug free.
| ||||||||||||||||||||||||||||||||||||||
| Comment by Tess Avitabile (Inactive) [ 30/Jun/17 ] | ||||||||||||||||||||||||||||||||||||||
|
Ah sorry I didn't point this out in my example. For the upper bound, you need to "increment" the last letter in the prefix. So in this query, you would use
| ||||||||||||||||||||||||||||||||||||||
| Comment by dane truelson [ 30/Jun/17 ] | ||||||||||||||||||||||||||||||||||||||
|
Hi Tess, Thanks so much for the reply. You all have done great work with collation support! I tried the workaround you suggested, and it did not work. I think it's because the word "john" is $gte "joh", but it is NOT $lte "joh". In other words, the full word is greater than the partial word, so the $lt part won't work. I did some testing and found this: db.people.find({firstName:{$gte:"joh"}}); now when I run Would you please double check the workaround and let me know if there is another way? | ||||||||||||||||||||||||||||||||||||||
| Comment by Tess Avitabile (Inactive) [ 30/Jun/17 ] | ||||||||||||||||||||||||||||||||||||||
|
Hello Dane, You raise a great question. This behavior is by design. In an index that uses collation, strings are stored as their ICU collation keys for that collation. So it seems as though for the regex /^searchterm\/i on an index with case-insensitive collation, we could use index bounds [ICUCollationKey("searchterm"), ICUCollationKey("searchtern")). However, while this would work in the English collation, it will not work for all collations. For example, for locales that sort from back to front, not all strings with the same prefix are sorted together, e.g. "ab" < "cb" < "ac". In general, the locale is allowed to define an arbitrary sort order over strings, so we cannot make assumptions as to how these sort orders relate to the binary comparison expected from regexes. As a workaround, you can write "starts with" regex queries using $gt an $lt, which can use a collation, i.e.
If the collation of the operation matches the collation of the index, it will be able to use that index. Best, |