[SERVER-34933] pcre verb support Created: 10/May/18 Updated: 29/Oct/23 Resolved: 21/Jun/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Querying |
| Affects Version/s: | 3.6.0, 4.0.0 |
| Fix Version/s: | 3.6.6, 4.0.1, 4.1.1 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Alexander Abolishin | Assignee: | Kyle Suarez |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Backport Requested: |
v4.0, v3.6
|
||||||||||||||||||||
| Sprint: | Query 2018-06-18, Query 2018-07-02 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
Docker container from mongo:3.4 image. Works as expected. }) Docker container from mongo:3.6 image. Throws error. }) > centos7 build of 3.6 from https://repo.mongodb.org/ yields the same error as docker image. |
| Comments |
| Comment by Githook User [ 29/Jun/18 ] | |||||||||||
|
Author: {'username': 'ksuarz', 'name': 'Kyle Suarez', 'email': 'kyle.suarez@mongodb.com'}Message: (cherry picked from commit a2ee109e64923e0e569fa8adb0dbc67488a77983) | |||||||||||
| Comment by Githook User [ 29/Jun/18 ] | |||||||||||
|
Author: {'username': 'ksuarz', 'name': 'Kyle Suarez', 'email': 'kyle.suarez@mongodb.com'}Message: Revert " There exist PCRE regexes whose error strings are nonempty but can still This reverts commit 1531b7b7280dd37d5f7ffd49171a65305ad442ba. (cherry picked from commit e9a6f62b3053733354cabb6d9488d0d4dcc6c424) | |||||||||||
| Comment by Alexander Abolishin [ 22/Jun/18 ] | |||||||||||
|
Thank you. | |||||||||||
| Comment by Kyle Suarez [ 21/Jun/18 ] | |||||||||||
|
Hi x86xlat, Thank you for reporting this bug! We've made adjustments to our regular expressions that should restore the functionality of the (*UCP) option. The fix will be included in the upcoming 3.6.6 release. Best, | |||||||||||
| Comment by Kyle Suarez [ 21/Jun/18 ] | |||||||||||
|
Author: {'username': 'ksuarz', 'name': 'Kyle Suarez', 'email': 'kyle.suarez@mongodb.com'}Message: Revert " There exist PCRE regexes whose error strings are nonempty but can still This reverts commit 1531b7b7280dd37d5f7ffd49171a65305ad442ba. | |||||||||||
| Comment by Githook User [ 21/Jun/18 ] | |||||||||||
|
Author: {'username': 'ksuarz', 'name': 'Kyle Suarez', 'email': 'kyle.suarez@mongodb.com'}Message: (cherry picked from commit a2ee109e64923e0e569fa8adb0dbc67488a77983) Conflicts: | |||||||||||
| Comment by Githook User [ 20/Jun/18 ] | |||||||||||
|
Author: {'username': 'ksuarz', 'name': 'Kyle Suarez', 'email': 'kyle.suarez@mongodb.com'}Message: | |||||||||||
| Comment by Githook User [ 20/Jun/18 ] | |||||||||||
|
Author: {'username': 'ksuarz', 'name': 'Kyle Suarez', 'email': 'kyle.suarez@mongodb.com'}Message: Revert " There exist PCRE regexes whose error strings are nonempty but can still This reverts commit 1531b7b7280dd37d5f7ffd49171a65305ad442ba. | |||||||||||
| Comment by Kyle Suarez [ 19/Jun/18 ] | |||||||||||
|
I've filed a bug report with BugZilla: https://bugs.exim.org/show_bug.cgi?id=2283 I tried to step through the pcrecpp library with LLDB, but the flow of control is confusing inside of pcre_compile.c and I kept getting useless output in frame variable. In my personal opinion, the "real fix" is to clear RE::error_ if RE::partial_ is set at the end of RE::Init(). However, after diving into the code, I feel like this may be risky, and I advocate for simply reverting | |||||||||||
| Comment by Kyle Suarez [ 01/Jun/18 ] | |||||||||||
I have a clarifying question – if it is treated as a "global one-time setting", does that mean that all regexes are treated as if the UCP verb was_always_ specified? Or is it the (in my opinion, expected) case that not specifying UCP treats non-ASCII characters as unknowns?
They say you can file a bug report on BugZilla. I took a cursory look at the existing filed bugs and didn't see it here: https://bugs.exim.org/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&product=PCRE | |||||||||||
| Comment by Asya Kamsky [ 01/Jun/18 ] | |||||||||||
|
We need to fix this.
However, had we not detected this error and aborted (which we didn't prior to 3.6) it turns out the (*UCP) verb is correctly being applied to the regex! What this means is that in 3.4 this worked correctly, and in 3.6 it now throws an error instead. This is because later they look for special prefix sequence (*...) including UCP which for some reason are referred to as "global one time settings" and not "verbs" (see https://github.com/mongodb/mongo/blob/993f30454af95b3d9c1a377b96d258849ac0fa46/src/third_party/pcre-8.41/pcre_compile.c#L9138-L9169). So we end up with situation where even though the error string was set to "VERB not recognized" it's because it's later recognized as a "global one time setting" and the regex match happens exactly as it should and we should not have aborted with an error. So maybe technically "PCRE library needs to fix this" but ... I won't hold my breath (maybe we can submit a PR after fixing this for our users though). | |||||||||||
| Comment by Asya Kamsky [ 18/May/18 ] | |||||||||||
|
Is this related to SERVER-7218 | |||||||||||
| Comment by Gabriel Russell (Inactive) [ 16/May/18 ] | |||||||||||
|
I doubt that this is related to the strtoll/strtoq change. I'll add this to the query backlog for them to triage. | |||||||||||
| Comment by Kyle Suarez [ 15/May/18 ] | |||||||||||
|
Hmmm, I'm not sure what's going on. I dug a little deeper after an offline discussion with gabriel.russell. Looking at our vendored copy of PCRE on r3.4.14 versus the current tip of master, the only difference is Next, I looked at the code in RegexMatchExpression. It underwent a major refactor as part of Lastly, I took a look at how we're compiling and building pcrecpp. The SConscript is the same in r3.4.14 versus master, and there are comments that suggest that Unicode properties support is enabled for all builders:
From my conversation with Gabriel, my understanding is that the ./configure call is done one time by some person upstream that converts the autotools style of building into our SConscript. Perhaps that needs to be looked at? gabriel.russell, do you think that we should assign this to the Platforms Team to investigate some pcrecpp build strangeness? Or is that unlikely, and we should keep this on the Query Team to dig further in our own mongo codebase? | |||||||||||
| Comment by Alexander Abolishin [ 11/May/18 ] | |||||||||||
|
Mongo 3.4 will match documents containing non ascii characters in field x.
Mongo 3.6 will not. Ucp turns on unicode properties for character types: | |||||||||||
| Comment by Kyle Suarez [ 10/May/18 ] | |||||||||||
|
Hi x86xlat, I believe "UCP" is not a valid PCRE verb. If you are intending on using it as a mark, you'll have to spell it as
or
Looking at the source code, the version of pcrecpp used in both r3.4.14 and in master is pcre-8.41. I think the difference is Could you please try this again on your data and verify that the updated $regex actually matches documents? Thanks, |