[SERVER-26991] Inconsistent handling of RegEx options Created: 10/Nov/16 Updated: 08/Oct/21 Resolved: 07/Jul/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | JavaScript, Querying, Shell, Tools |
| Affects Version/s: | None |
| Fix Version/s: | 5.1.0-rc0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Hannes Magnusson | Assignee: | Mickey Winters |
| Resolution: | Done | Votes: | 0 |
| Labels: | sbe-diff, sbe-post-v1, sbe-rollout, sp-shell | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Minor Change | ||||||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v4.4
|
||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Query Execution 2021-06-28, Query Execution 2021-07-12 | ||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
The server documents certain regex modifiers it supports: imxs It is actually inconsistent within itself of which options it supports: These modifiers are inconsistent with BSON Regex modifiers: imxlsu The shell however only allows and supports: gimy The server tools allow and support yet another set: gims |
| Comments |
| Comment by Githook User [ 02/Jul/21 ] | |||||
|
Author: {'name': 'Mickey. J Winters', 'email': 'mickey.winters@mongodb.com', 'username': 'mjrb'}Message: | |||||
| Comment by Bernie Hackett [ 07/Jun/21 ] | |||||
|
There's a possibly Python specific wrinkle here. In Python 3 all text is unicode. and regular expressions have the 'u' flag set by default.
| |||||
| Comment by Ian Boros [ 08/May/18 ] | |||||
|
Whenever we get around to this we should probably also fix the fact that the shell doesn't seem to validate the regex options it's given correctly:
| |||||
| Comment by Bernie Hackett [ 07/Nov/17 ] | |||||
|
I agree, but that would be a backward breaking change for users that store regular expressions rather than just use them in queries. It will also just break a lot of existing applications that are (accidentally?) successfully using language native regular expressions . I'd prefer to wait until this server ticket is resolved and the BSON spec updated. Then we can decide how and when to make breaking changes to drivers. | |||||
| Comment by Roy Williams [ 07/Nov/17 ] | |||||
|
@behackett Agreed, This could be another ticket, but it seems like the mongo-python-driver should be filtering out invalid flags or throw instead of passing them along to the server only to have them rejected. | |||||
| Comment by Bernie Hackett [ 07/Nov/17 ] | |||||
|
rwilliams-lyft, I suggest not using Python regular expressions with MongoDB. MongoDB uses PCRE for regular expression support. Though Python regular expressions are similar, they are not the same. Sometimes the difference will cause the server to return an error. Sometimes the difference will cause you to get different results than you expected. You are better off using the $regex query operator instead: https://docs.mongodb.com/manual/reference/operator/query/regex/ | |||||
| Comment by Roy Williams [ 07/Nov/17 ] | |||||
|
FWIW we just got bit by this during our upgrade to Python 3, Python3 regexes always have the `u` flag set, which is getting passed through https://github.com/mongodb/mongo-python-driver unmodified. This then lead to us doing a full table scan instead of using an index since the index. For example (using mongoengine): ``` will perform the mongo query ``` ); | |||||
| Comment by Bernie Hackett [ 14/Nov/16 ] | |||||
|
According to the design document, $regex and collations don't mix.
That would appear to limit the usefulness of 'u' server side. 'l' seems like a weird option server side. | |||||
| Comment by David Golden [ 11/Nov/16 ] | |||||
|
I suggest someone familiar with collation look into whether 'u' is significant/useful (and whether it's supported by the underlying regex engine). For example, in Perl, you need "u" with "i" for Unicode-aware case-insensitive matching. Also, "u" changes the meaning of "\w" and "\d" to match all Unicode numbers. (Perl also offers an "a" modifier to turn that off so "\d" and "\w" only match ASCII.) If we're going to change the spec on flags, we should do so in a collation-aware way now that we support it. | |||||
| Comment by Bernie Hackett [ 11/Nov/16 ] | |||||
|
The server isn't consistent internally. Here it expects imsx: https://github.com/mongodb/mongo/blob/r3.4.0-rc3/src/mongo/db/matcher/expression_leaf.cpp#L232-L247 And here it expects gim: https://github.com/mongodb/mongo/blob/r3.4.0-rc3/src/mongo/bson/bsonelement.cpp#L245-L263 That last part is really confusing considering |