[SERVER-23881] allow regex word character (\w) and word boundary (\b) escapes to be unicode-aware Created: 22/Apr/16  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Querying
Affects Version/s: 3.0.11
Fix Version/s: None

Type: New Feature Priority: Major - P3
Reporter: Nic Cottrell (Personal) Assignee: Backlog - Query Optimization
Resolution: Unresolved Votes: 2
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-7218 Turn on PCRE_UCP config option to pcr... Backlog
Assigned Teams:
Query Optimization
Participants:
Case:

 Description   

Provide a way to use regular expressions in MongoDB where the word character (\w) and word boundary (\b) escapes work for code points greater than or equal to 256.

Original description

$regex word boundary fails by treating Danish ø character as a non-character

db.collection.find({ "name" : { "$regex" : ".*\\bden\\b.*" , "$options" : "i"} })

returns a document:

{  "name": "Death Is A Caress(Døden Er Et Kjærtegn).sub" }



 Comments   
Comment by Asya Kamsky [ 16/Jul/21 ]

I think this user ran into the same issue:  https://www.mongodb.com/community/forums/t/regex-whole-word-match-not-working-for-vietnamese-language/114780/4

 

Comment by Nic Cottrell (Personal) [ 27/Apr/16 ]

It seems that this will work for my case:

{ "name" : { "$regex" : "(^|.*\\s+)den(\\s+.*|$)" , "$options" : "i"} }

Comment by Nic Cottrell (Personal) [ 27/Apr/16 ]

Thanks - I just read up and understand. It's a bit of a shock since I'm used to Java's engine that seems to be the exception..

Comment by David Storch [ 27/Apr/16 ]

Hi niccottrell, just to clarify: I believe that we do build PCRE with UTF-8 support enabled. It appears that the default behavior of the escapes \b and \w, among others, is simply not changed when unicode support is enabled.

Comment by Nic Cottrell (Personal) [ 27/Apr/16 ]

Thanks for the details. Strange that there's not a way to flip PCRE into Unicode mode. Most regex platforms/libraries I know allow a "u" Unicode flag to force behaviour like this. It feels a bit wrong for Mongo to not include regex support for at least other European languages out of the box.

Comment by David Storch [ 26/Apr/16 ]

Hi niccottrell,

Thanks for reporting this issue! You are absolutely correct that PCRE considers the Danish character ø to be a non-word character. This can be seen more clearly in the following example:

> db.c.drop();
> db.c.insert({test: "ø"});
> db.c.insert({test: "a"});
> db.c.find({test: /\w/});
{ "_id" : ObjectId("571fe12e5bf4a8f145edf56c"), "test" : "a" }
> db.c.find({test: /\W/});
{ "_id" : ObjectId("571fe12b5bf4a8f145edf56b"), "test" : "ø" }

This is why ø is forming a word boundary. The PCRE manual has the following to say on the subject:

PCRE handles caseless matching, and determines whether characters are
letters, digits, or whatever, by reference to a set of tables, indexed
by character code point. When running in UTF-8 mode, or in the 16- or
32-bit libraries, this applies only to characters with code points less
than 256. By default, higher-valued code points never match escapes
such as \w or \d. However, if PCRE is built with Unicode property sup-
port, all characters can be tested with \p and \P, or, alternatively,
the PCRE_UCP option can be set when a pattern is compiled; this causes
\w and friends to use Unicode property support instead of the built-in
tables.

It elaborates:

The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test characters of any code value, but, by default, the characters that PCRE recognizes as digits, spaces, or word characters remain the same set as in non-UTF mode, all with values less than 256. This remains true even when PCRE is built to include Unicode property support, because to do otherwise would slow down PCRE in many common cases. Note in particular that this applies to \b and \B, because they are defined in terms of \w and \W. If you really want to test for a wider sense of, say, "digit", you can use explicit Unicode property tests such as \p{Nd}. Alternatively, if you set the PCRE_UCP option, the way that the character escapes work is changed so that Unicode properties are used to determine which characters match. There are more details in the section on generic character types in the pcrepattern documentation.

So, to summarize, this is the expected behavior of PCRE. Unfortunately, as a user of MongoDB, there is no way to make use of the PCRE_UCP option or to otherwise coerce PCRE to use the unicode-aware regular expression semantics you desire. Therefore, we will keep this ticket open as a feature request. I am going to move it into Needs Triage state, and our development team will evaluate it at our next triage meeting. Please continue to watch for updates.

Best,
Dave

Generated at Thu Feb 08 04:04:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.