[SERVER-7218] Turn on PCRE_UCP config option to pcre build to enable some regex characters (\b \B \d etc) to work with UTF8 characters Created: 01/Oct/12 Updated: 06/Dec/22 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Build, Querying |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Asya Kamsky | Assignee: | Backlog - Query Execution |
| Resolution: | Unresolved | Votes: | 9 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Assigned Teams: |
Query Execution
|
||||||||||||
| Backwards Compatibility: | Minor Change | ||||||||||||
| Sprint: | QE 2021-09-06, QE 2021-09-20, QE 2021-10-04, QE 2021-10-18, QE 2021-11-01, QE 2021-11-15, QE 2021-11-29, QE 2021-12-13, QE 2021-12-27, QE 2022-01-10, QE 2022-01-24 | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
PCRE_UCP This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W, Without this option characters that match word boundary (\b for example) do not behave correctly when the word starts with a UTF8 character. Adapted from https://groups.google.com/forum/?fromgroups=#!topic/mongodb-user/owqLT6b-weE
but
|
| Comments |
| Comment by Jennifer Peshansky (Inactive) [ 14/Jul/22 ] | ||||||||||
|
After investigating the possibility of enabling PCRE2_UCP by default for all queries, we have found that it has an unacceptable performance impact on all queries affected. However, this option can be enabled for any specific pattern by beginning the pattern with (*UCP). The performance impact is due to the fact that when UCP is enabled, PCRE2 (as well as the deprecated PCRE) must do a multistage table lookup to find the unicode property of each character. The PCRE2 documentation does not recommend enabling this option by default. We hope the workaround is sufficient for most use cases. As there is no comparable workaround for turning UCP off once it's on, we have decided to leave the default behavior as is, and document the (*UCP) option. You can track this work in | ||||||||||
| Comment by Jennifer Peshansky (Inactive) [ 01/Oct/21 ] | ||||||||||
|
I've looked into whether this will still be an issue after we upgrade to PCRE2, and, unfortunately, it looks like it will be. The PCRE2 unicode docs include almost the same exact paragraph about \b, \B, etc. that the docs for PCRE do, with no additional suggestions or workarounds given. | ||||||||||
| Comment by Ana Meza [ 10/Aug/21 ] | ||||||||||
|
Kyle, please investigate if Dave's comment in SERVER-23881 is accurate | ||||||||||
| Comment by Kyle Suarez [ 06/Aug/21 ] | ||||||||||
|
I'm sending this ticket back to the triage queue for consideration by Query Execution. | ||||||||||
| Comment by Tad Marshall [ 01/Oct/12 ] | ||||||||||
|
Code added here (in my build):
| ||||||||||
| Comment by Asya Kamsky [ 01/Oct/12 ] | ||||||||||
|
I built a version with PCRE_UCP turned on
to matcher.cpp) Now it seems to do the right thing:
|