[SERVER-34933] pcre verb support Created: 10/May/18  Updated: 29/Oct/23  Resolved: 21/Jun/18

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: 3.6.0, 4.0.0
Fix Version/s: 3.6.6, 4.0.1, 4.1.1

Type: Bug Priority: Critical - P2
Reporter: Alexander Abolishin Assignee: Kyle Suarez
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Problem/Incident
is caused by SERVER-30986 Server should error on invalid regex ... Closed
Related
related to SERVER-39697 Regex MatchExpression should error at... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.0, v3.6
Sprint: Query 2018-06-18, Query 2018-07-02
Participants:

 Description   

Docker container from mongo:3.4 image. Works as expected.
$ docker run -itd --rm --name mongo.test mongo:3.4
$ docker exec -it mongo.test mongo
MongoDB shell version v3.4.14
connecting to: mongodb://127.0.0.1:27017
MongoDB server version: 3.4.14
... warnings ...
> db.test.find({ test:

{ $regex: '(*UCP)test' }

})
> // no error

Docker container from mongo:3.6 image. Throws error.
$ docker run -itd --rm --name mongo.test mongo:3.6
$ docker exec -it mongo.test mongo
MongoDB shell version v3.6.3
connecting to: mongodb://127.0.0.1:27017
MongoDB server version: 3.6.3
... warnings ...
> db.test.find({ test:

{ $regex: '(*UCP)test' }

})
Error: error:

{ "ok" : 0, "errmsg" : "Regular expression is invalid: (*VERB) not recognized or malformed", "code" : 2, "codeName" : "BadValue" }

>

centos7 build of 3.6 from https://repo.mongodb.org/ yields the same error as docker image.



 Comments   
Comment by Githook User [ 29/Jun/18 ]

Author:

{'username': 'ksuarz', 'name': 'Kyle Suarez', 'email': 'kyle.suarez@mongodb.com'}

Message: SERVER-34933 tests for PCRE verbs and Unicode options

(cherry picked from commit a2ee109e64923e0e569fa8adb0dbc67488a77983)
Branch: v4.0
https://github.com/mongodb/mongo/commit/2c5827fbc7daffc5ca1bead13ab6d77ea0c5e4d9

Comment by Githook User [ 29/Jun/18 ]

Author:

{'username': 'ksuarz', 'name': 'Kyle Suarez', 'email': 'kyle.suarez@mongodb.com'}

Message: Revert "SERVER-30986 Prevent invalid regex in RegexMatchExpressions"

There exist PCRE regexes whose error strings are nonempty but can still
be used for matching. This restores the functionality of the (*UCP)
option for SERVER-34933.

This reverts commit 1531b7b7280dd37d5f7ffd49171a65305ad442ba.
Conflicts:
src/mongo/db/matcher/expression_leaf.cpp
src/mongo/db/matcher/expression_leaf.h
src/mongo/db/matcher/expression_leaf_test.cpp

(cherry picked from commit e9a6f62b3053733354cabb6d9488d0d4dcc6c424)
Branch: v4.0
https://github.com/mongodb/mongo/commit/3cd8ab42a9401bebccbc2294d86149e71801f167

Comment by Alexander Abolishin [ 22/Jun/18 ]

Thank you.

Comment by Kyle Suarez [ 21/Jun/18 ]

Hi x86xlat,

Thank you for reporting this bug! We've made adjustments to our regular expressions that should restore the functionality of the (*UCP) option. The fix will be included in the upcoming 3.6.6 release.

Best,
Kyle

Comment by Kyle Suarez [ 21/Jun/18 ]

Author:

{'username': 'ksuarz', 'name': 'Kyle Suarez', 'email': 'kyle.suarez@mongodb.com'}

Message: Revert "SERVER-30986 Prevent invalid regex in RegexMatchExpressions"

There exist PCRE regexes whose error strings are nonempty but can still
be used for matching. This restores the functionality of the (*UCP)
option for SERVER-34933.

This reverts commit 1531b7b7280dd37d5f7ffd49171a65305ad442ba.
Branch: v3.6
https://github.com/mongodb/mongo/commit/568086fa19039d2a879f2da602456e341bfed102

Comment by Githook User [ 21/Jun/18 ]

Author:

{'username': 'ksuarz', 'name': 'Kyle Suarez', 'email': 'kyle.suarez@mongodb.com'}

Message: SERVER-34933 tests for PCRE verbs and Unicode options

(cherry picked from commit a2ee109e64923e0e569fa8adb0dbc67488a77983)

Conflicts:
jstests/core/regex_limit.js
jstests/core/regex_unicode.js
jstests/core/regex_verbs.js
src/mongo/db/matcher/expression_leaf_test.cpp
Branch: v3.6
https://github.com/mongodb/mongo/commit/ae59210aaff95de6acc04434de9788bf1a5f730a

Comment by Githook User [ 20/Jun/18 ]

Author:

{'username': 'ksuarz', 'name': 'Kyle Suarez', 'email': 'kyle.suarez@mongodb.com'}

Message: SERVER-34933 tests for PCRE verbs and Unicode options
Branch: master
https://github.com/mongodb/mongo/commit/a2ee109e64923e0e569fa8adb0dbc67488a77983

Comment by Githook User [ 20/Jun/18 ]

Author:

{'username': 'ksuarz', 'name': 'Kyle Suarez', 'email': 'kyle.suarez@mongodb.com'}

Message: Revert "SERVER-30986 Prevent invalid regex in RegexMatchExpressions"

There exist PCRE regexes whose error strings are nonempty but can still
be used for matching. This restores the functionality of the (*UCP)
option for SERVER-34933.

This reverts commit 1531b7b7280dd37d5f7ffd49171a65305ad442ba.
Conflicts:
src/mongo/db/matcher/expression_leaf.cpp
src/mongo/db/matcher/expression_leaf.h
src/mongo/db/matcher/expression_leaf_test.cpp
Branch: master
https://github.com/mongodb/mongo/commit/e9a6f62b3053733354cabb6d9488d0d4dcc6c424

Comment by Kyle Suarez [ 19/Jun/18 ]

I've filed a bug report with BugZilla: https://bugs.exim.org/show_bug.cgi?id=2283

I tried to step through the pcrecpp library with LLDB, but the flow of control is confusing inside of pcre_compile.c and I kept getting useless output in frame variable. In my personal opinion, the "real fix" is to clear RE::error_ if RE::partial_ is set at the end of RE::Init(). However, after diving into the code, I feel like this may be risky, and I advocate for simply reverting SERVER-30986. Depending on how the developers respond, we can try to get a patch upstreamed, but for now, we should at least give back to our users the ability to use the Unicode options.

Comment by Kyle Suarez [ 01/Jun/18 ]

So we end up with situation where even though the error string was set to "VERB not recognized" it's because it's later recognized as a "global one time setting" and the regex match happens exactly as it should and we should not have aborted with an error.

I have a clarifying question – if it is treated as a "global one-time setting", does that mean that all regexes are treated as if the UCP verb was_always_ specified? Or is it the (in my opinion, expected) case that not specifying UCP treats non-ASCII characters as unknowns?

maybe we can submit a PR after fixing this for our users though

They say you can file a bug report on BugZilla. I took a cursory look at the existing filed bugs and didn't see it here: https://bugs.exim.org/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&product=PCRE

Comment by Asya Kamsky [ 01/Jun/18 ]

We need to fix this.

SERVER-30986 checks the error string that PCRE library sets and sure enough in this case it's being set to `(*VERB) not recognized or malformed` however, this is because it's not found in this list (apparently).

However, had we not detected this error and aborted (which we didn't prior to 3.6) it turns out the (*UCP) verb is correctly being applied to the regex! What this means is that in 3.4 this worked correctly, and in 3.6 it now throws an error instead.

This is because later they look for special prefix sequence (*...) including UCP which for some reason are referred to as "global one time settings" and not "verbs" (see https://github.com/mongodb/mongo/blob/993f30454af95b3d9c1a377b96d258849ac0fa46/src/third_party/pcre-8.41/pcre_compile.c#L9138-L9169).

So we end up with situation where even though the error string was set to "VERB not recognized" it's because it's later recognized as a "global one time setting" and the regex match happens exactly as it should and we should not have aborted with an error.  

So maybe technically "PCRE library needs to fix this" but ... I won't hold my breath (maybe we can submit a PR after fixing this for our users though).

Comment by Asya Kamsky [ 18/May/18 ]

Is this related to SERVER-7218 

Comment by Gabriel Russell (Inactive) [ 16/May/18 ]

I doubt that this is related to the strtoll/strtoq change. I'll add this to the query backlog for them to triage.

Comment by Kyle Suarez [ 15/May/18 ]

Hmmm, I'm not sure what's going on. I dug a little deeper after an offline discussion with gabriel.russell. Looking at our vendored copy of PCRE on r3.4.14 versus the current tip of master, the only difference is SERVER-32391, which unsets HAVE_STRTOQ and instead sets HAVE_STRTOLL. I grepped around the PCRE code and there are no immediately obvious pieces of code that depend on both HAVE_STRTOQ and SUPPORT_UCP. The only uses of strtoq(3) are in parse_longlong_radix() and parse_ulonglong_radix(), and my understanding of that code is that it has nothing to do with UCP support.

Next, I looked at the code in RegexMatchExpression. It underwent a major refactor as part of SERVER-30783, but the actual initialization code that calls into pcrecpp is essentially the same in both r3.4.14 and the two constructors on the tip of master.

Lastly, I took a look at how we're compiling and building pcrecpp. The SConscript is the same in r3.4.14 versus master, and there are comments that suggest that Unicode properties support is enabled for all builders:

# Generated via
# AutoTools (non-Windows)
#  ./configure --disable-stack-for-recursion --enable-utf --enable-unicode-properties
#   --with-match-limit=200000 --with-match-limit-recursion=4000 --enable-shared=no
# CMake (Windows)
#  -DPCRE_SUPPORT_PCREGREP_JIT:BOOL="0" -DPCRE_BUILD_TESTS:BOOL="0"
#  -DPCRE_POSIX_MALLOC_THRESHOLD:STRING="10" -DPCRE_MATCH_LIMIT_RECURSION:STRING="4000"
#  -DPCRE_NO_RECURSE:BOOL="1" -DPCRE_LINK_SIZE:STRING="2" -DPCRE_NEWLINE:STRING="LF"
#  -DPCRE_SUPPORT_UNICODE_PROPERTIES:BOOL="1" -DPCREGREP_BUFSIZE:STRING="20480"
#  -DPCRE_MATCH_LIMIT:STRING="200000" -DPCRE_PARENS_NEST_LIMIT:STRING="250"
#  -DPCRE_SUPPORT_UTF:BOOL="1"

From my conversation with Gabriel, my understanding is that the ./configure call is done one time by some person upstream that converts the autotools style of building into our SConscript. Perhaps that needs to be looked at?

gabriel.russell, do you think that we should assign this to the Platforms Team to investigate some pcrecpp build strangeness? Or is that unlikely, and we should keep this on the Query Team to dig further in our own mongo codebase?

Comment by Alexander Abolishin [ 11/May/18 ]

Mongo 3.4 will match documents containing non ascii characters in field x.

db.coll.find({ x: { $regex: '(*UCP)\\w' } })

Mongo 3.6 will not.

Ucp turns on unicode properties for character types:

Comment by Kyle Suarez [ 10/May/18 ]

Hi x86xlat,

I believe "UCP" is not a valid PCRE verb. If you are intending on using it as a mark, you'll have to spell it as

db.coll.find({x: {$regex: "(*MARK:UCP)test"}})

or

db.coll.find({x: {$regex: "(*:UCP)test"}})

Looking at the source code, the version of pcrecpp used in both r3.4.14 and in master is pcre-8.41. I think the difference is SERVER-30986: prior to the resolution of that ticket, an invalid regex would silently match no documents. Now, the server correctly raises an error.

Could you please try this again on your data and verify that the updated $regex actually matches documents?

Thanks,
Kyle

Generated at Thu Feb 08 04:38:18 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.