-
Type: Improvement
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
None
-
Query Execution
-
Minor Change
-
QE 2021-09-06, QE 2021-09-20, QE 2021-10-04, QE 2021-10-18, QE 2021-11-01, QE 2021-11-15, QE 2021-11-29, QE 2021-12-13, QE 2021-12-27, QE 2022-01-10, QE 2022-01-24
PCRE_UCP
This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W,
\w, and some of the POSIX character classes. By default, only ASCII
characters are recognized, but if PCRE_UCP is set, Unicode properties
are used instead to classify characters. More details are given in the
section on generic character types in the pcrepattern page. If you set
PCRE_UCP, matching one of the items it affects takes much longer. The
option is available only if PCRE has been compiled with Unicode prop-
erty support.
Without this option characters that match word boundary (\b for example) do not behave correctly when the word starts with a UTF8 character.
Adapted from https://groups.google.com/forum/?fromgroups=#!topic/mongodb-user/owqLT6b-weE
so@local(2.2.0) > db.subjects.find( { labelfr: /colo/ }) { "_id" : ObjectId("5069baa4b049b18f5c52d1ac"), "labelfr" : "Écologie" } { "_id" : ObjectId("5069bb78b049b18f5c52d1ae"), "labelfr" : "word Écologie" } { "_id" : ObjectId("5069bcd7b049b18f5c52d1af"), "labelfr" : "word ecologie" } Fetched 3 record(s) in 5ms
but
so@local(2.2.0) > db.subjects.find( { labelfr: /\bcolo/ }) { "_id" : ObjectId("5069baa4b049b18f5c52d1ac"), "labelfr" : "Écologie" } { "_id" : ObjectId("5069bb78b049b18f5c52d1ae"), "labelfr" : "word Écologie" } Fetched 2 record(s) in 13ms so@local(2.2.0) > db.subjects.find( { labelfr: /\Bcolo/ }) { "_id" : ObjectId("5069bcd7b049b18f5c52d1af"), "labelfr" : "word ecologie" } Fetched 1 record(s) in 9ms so@local(2.2.0) > db.subjects.find( { labelfr: /\BÉcolo/ }) { "_id" : ObjectId("5069baa4b049b18f5c52d1ac"), "labelfr" : "Écologie" } { "_id" : ObjectId("5069bb78b049b18f5c52d1ae"), "labelfr" : "word Écologie" } Fetched 2 record(s) in 9ms so@local(2.2.0) > db.subjects.find( { labelfr: /\bÉcolo/ }) Fetched 0 record(s) in 6ms
- is related to
-
SERVER-23881 allow regex word character (\w) and word boundary (\b) escapes to be unicode-aware
- Backlog