[SERVER-7218] Turn on PCRE_UCP config option to pcre build to enable some regex characters (\b \B \d etc) to work with UTF8 characters Created: 01/Oct/12  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Build, Querying
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Asya Kamsky Assignee: Backlog - Query Execution
Resolution: Unresolved Votes: 9
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-23881 allow regex word character (\w) and w... Backlog
is related to DOCS-15482 Document *UCP option to Enable PCRE2_... Closed
Assigned Teams:
Query Execution
Backwards Compatibility: Minor Change
Sprint: QE 2021-09-06, QE 2021-09-20, QE 2021-10-04, QE 2021-10-18, QE 2021-11-01, QE 2021-11-15, QE 2021-11-29, QE 2021-12-13, QE 2021-12-27, QE 2022-01-10, QE 2022-01-24
Participants:

 Description   

http://www.pcre.org/pcre.txt

PCRE_UCP

This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W,
\w, and some of the POSIX character classes. By default, only ASCII
characters are recognized, but if PCRE_UCP is set, Unicode properties
are used instead to classify characters. More details are given in the
section on generic character types in the pcrepattern page. If you set
PCRE_UCP, matching one of the items it affects takes much longer. The
option is available only if PCRE has been compiled with Unicode prop-
erty support.

Without this option characters that match word boundary (\b for example) do not behave correctly when the word starts with a UTF8 character.

Adapted from https://groups.google.com/forum/?fromgroups=#!topic/mongodb-user/owqLT6b-weE

so@local(2.2.0) > db.subjects.find( { labelfr: /colo/ })
{ "_id" : ObjectId("5069baa4b049b18f5c52d1ac"), "labelfr" : "Écologie" }
{ "_id" : ObjectId("5069bb78b049b18f5c52d1ae"), "labelfr" : "word Écologie" }
{ "_id" : ObjectId("5069bcd7b049b18f5c52d1af"), "labelfr" : "word ecologie" }
Fetched 3 record(s) in 5ms

but

so@local(2.2.0) > db.subjects.find( { labelfr: /\bcolo/ })
{ "_id" : ObjectId("5069baa4b049b18f5c52d1ac"), "labelfr" : "Écologie" }
{ "_id" : ObjectId("5069bb78b049b18f5c52d1ae"), "labelfr" : "word Écologie" }
Fetched 2 record(s) in 13ms
so@local(2.2.0) > db.subjects.find( { labelfr: /\Bcolo/ })
{ "_id" : ObjectId("5069bcd7b049b18f5c52d1af"), "labelfr" : "word ecologie" }
Fetched 1 record(s) in 9ms
so@local(2.2.0) > db.subjects.find( { labelfr: /\BÉcolo/ })
{ "_id" : ObjectId("5069baa4b049b18f5c52d1ac"), "labelfr" : "Écologie" }
{ "_id" : ObjectId("5069bb78b049b18f5c52d1ae"), "labelfr" : "word Écologie" }
Fetched 2 record(s) in 9ms
so@local(2.2.0) > db.subjects.find( { labelfr: /\bÉcolo/ })
Fetched 0 record(s) in 6ms



 Comments   
Comment by Jennifer Peshansky (Inactive) [ 14/Jul/22 ]

After investigating the possibility of enabling PCRE2_UCP by default for all queries, we have found that it has an unacceptable performance impact on all queries affected.

However, this option can be enabled for any specific pattern by beginning the pattern with (*UCP).

The performance impact is due to the fact that when UCP is enabled, PCRE2 (as well as the deprecated PCRE) must do a multistage table lookup to find the unicode property of each character. The PCRE2 documentation does not recommend enabling this option by default.

We hope the workaround is sufficient for most use cases. As there is no comparable workaround for turning UCP off once it's on, we have decided to leave the default behavior as is, and document the (*UCP) option. You can track this work in DOCS-15482.

Comment by Jennifer Peshansky (Inactive) [ 01/Oct/21 ]

I've looked into whether this will still be an issue after we upgrade to PCRE2, and, unfortunately, it looks like it will be. The PCRE2 unicode docs include almost the same exact paragraph about \b, \B, etc. that the docs for PCRE do, with no additional suggestions or workarounds given.

Comment by Ana Meza [ 10/Aug/21 ]

Kyle, please investigate if Dave's comment in SERVER-23881 is accurate

Comment by Kyle Suarez [ 06/Aug/21 ]

I'm sending this ticket back to the triage queue for consideration by Query Execution.

Comment by Tad Marshall [ 01/Oct/12 ]

Code added here (in my build):

    inline pcrecpp::RE_Options flags2options(const char* flags) {
        pcrecpp::RE_Options options;
        options.set_all_options( options.all_options() | PCRE_UCP ); // this line added
        options.set_utf8(true);
        while ( flags && *flags ) {
 

Comment by Asya Kamsky [ 01/Oct/12 ]

I built a version with PCRE_UCP turned on
(added line

options.set_all_options( options.all_options() | PCRE_UCP );

to matcher.cpp)

Now it seems to do the right thing:

> db.subjects.find( { labelfr: /\bcolo/ })
> db.subjects.find( { labelfr: /\Bcolo/ })
{ "_id" : ObjectId("5069baa4b049b18f5c52d1ac"), "labelfr" : "Écologie" }
{ "_id" : ObjectId("5069bb78b049b18f5c52d1ae"), "labelfr" : "word Écologie" }
{ "_id" : ObjectId("5069bcd7b049b18f5c52d1af"), "labelfr" : "word ecologie" }
> db.subjects.find( { labelfr: /\BÉcolo/ })
> db.subjects.find( { labelfr: /\bÉcolo/ })
{ "_id" : ObjectId("5069baa4b049b18f5c52d1ac"), "labelfr" : "Écologie" }
{ "_id" : ObjectId("5069bb78b049b18f5c52d1ae"), "labelfr" : "word Écologie" }
> 

Generated at Thu Feb 08 03:13:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.