Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-7218

Turn on PCRE_UCP config option to pcre build to enable some regex characters (\b \B \d etc) to work with UTF8 characters

    • Type: Icon: Improvement Improvement
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Build, Querying
    • Labels:
      None
    • Query Execution
    • Minor Change
    • QE 2021-09-06, QE 2021-09-20, QE 2021-10-04, QE 2021-10-18, QE 2021-11-01, QE 2021-11-15, QE 2021-11-29, QE 2021-12-13, QE 2021-12-27, QE 2022-01-10, QE 2022-01-24

      http://www.pcre.org/pcre.txt

      PCRE_UCP

      This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W,
      \w, and some of the POSIX character classes. By default, only ASCII
      characters are recognized, but if PCRE_UCP is set, Unicode properties
      are used instead to classify characters. More details are given in the
      section on generic character types in the pcrepattern page. If you set
      PCRE_UCP, matching one of the items it affects takes much longer. The
      option is available only if PCRE has been compiled with Unicode prop-
      erty support.

      Without this option characters that match word boundary (\b for example) do not behave correctly when the word starts with a UTF8 character.

      Adapted from https://groups.google.com/forum/?fromgroups=#!topic/mongodb-user/owqLT6b-weE

      so@local(2.2.0) > db.subjects.find( { labelfr: /colo/ })
      { "_id" : ObjectId("5069baa4b049b18f5c52d1ac"), "labelfr" : "Écologie" }
      { "_id" : ObjectId("5069bb78b049b18f5c52d1ae"), "labelfr" : "word Écologie" }
      { "_id" : ObjectId("5069bcd7b049b18f5c52d1af"), "labelfr" : "word ecologie" }
      Fetched 3 record(s) in 5ms
      

      but

      so@local(2.2.0) > db.subjects.find( { labelfr: /\bcolo/ })
      { "_id" : ObjectId("5069baa4b049b18f5c52d1ac"), "labelfr" : "Écologie" }
      { "_id" : ObjectId("5069bb78b049b18f5c52d1ae"), "labelfr" : "word Écologie" }
      Fetched 2 record(s) in 13ms
      so@local(2.2.0) > db.subjects.find( { labelfr: /\Bcolo/ })
      { "_id" : ObjectId("5069bcd7b049b18f5c52d1af"), "labelfr" : "word ecologie" }
      Fetched 1 record(s) in 9ms
      so@local(2.2.0) > db.subjects.find( { labelfr: /\BÉcolo/ })
      { "_id" : ObjectId("5069baa4b049b18f5c52d1ac"), "labelfr" : "Écologie" }
      { "_id" : ObjectId("5069bb78b049b18f5c52d1ae"), "labelfr" : "word Écologie" }
      Fetched 2 record(s) in 9ms
      so@local(2.2.0) > db.subjects.find( { labelfr: /\bÉcolo/ })
      Fetched 0 record(s) in 6ms
      

            Assignee:
            backlog-query-execution [DO NOT USE] Backlog - Query Execution
            Reporter:
            asya.kamsky@mongodb.com Asya Kamsky
            Votes:
            9 Vote for this issue
            Watchers:
            14 Start watching this issue

              Created:
              Updated: