[SERVER-23881] allow regex word character (\w) and word boundary (\b) escapes to be unicode-aware Created: 22/Apr/16 Updated: 06/Dec/22 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Querying |
| Affects Version/s: | 3.0.11 |
| Fix Version/s: | None |
| Type: | New Feature | Priority: | Major - P3 |
| Reporter: | Nic Cottrell (Personal) | Assignee: | Backlog - Query Optimization |
| Resolution: | Unresolved | Votes: | 2 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Query Optimization
|
||||||||
| Participants: | |||||||||
| Case: | (copied to CRM) | ||||||||
| Description |
| Comments |
| Comment by Asya Kamsky [ 16/Jul/21 ] | |||||||
|
I think this user ran into the same issue: https://www.mongodb.com/community/forums/t/regex-whole-word-match-not-working-for-vietnamese-language/114780/4
| |||||||
| Comment by Nic Cottrell (Personal) [ 27/Apr/16 ] | |||||||
|
It seems that this will work for my case:
| |||||||
| Comment by Nic Cottrell (Personal) [ 27/Apr/16 ] | |||||||
|
Thanks - I just read up and understand. It's a bit of a shock since I'm used to Java's engine that seems to be the exception.. | |||||||
| Comment by David Storch [ 27/Apr/16 ] | |||||||
|
Hi niccottrell, just to clarify: I believe that we do build PCRE with UTF-8 support enabled. It appears that the default behavior of the escapes \b and \w, among others, is simply not changed when unicode support is enabled. | |||||||
| Comment by Nic Cottrell (Personal) [ 27/Apr/16 ] | |||||||
|
Thanks for the details. Strange that there's not a way to flip PCRE into Unicode mode. Most regex platforms/libraries I know allow a "u" Unicode flag to force behaviour like this. It feels a bit wrong for Mongo to not include regex support for at least other European languages out of the box. | |||||||
| Comment by David Storch [ 26/Apr/16 ] | |||||||
|
Hi niccottrell, Thanks for reporting this issue! You are absolutely correct that PCRE considers the Danish character ø to be a non-word character. This can be seen more clearly in the following example:
This is why ø is forming a word boundary. The PCRE manual has the following to say on the subject:
It elaborates:
So, to summarize, this is the expected behavior of PCRE. Unfortunately, as a user of MongoDB, there is no way to make use of the PCRE_UCP option or to otherwise coerce PCRE to use the unicode-aware regular expression semantics you desire. Therefore, we will keep this ticket open as a feature request. I am going to move it into Needs Triage state, and our development team will evaluate it at our next triage meeting. Please continue to watch for updates. Best, |