[SERVER-11947] Add a regex expression to the aggregation language Created: 04/Dec/13  Updated: 19/Jun/19  Resolved: 30/Apr/19

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: 4.1.11

Type: New Feature Priority: Major - P3
Reporter: Norberto Fernando Rocha Leite (Inactive) Assignee: Arun Banala
Resolution: Done Votes: 42
Labels: asya, expression
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File SERVER-11947_fix.js    
Issue Links:
Depends
Documented
is documented by DOCS-12558 Docs for SERVER-39696: Implement $reg... Closed
is documented by DOCS-12669 Docs for SERVER-11947: Add a regex ex... Closed
Duplicate
is duplicated by SERVER-13902 Reverse regex functionality for queries Closed
is duplicated by SERVER-32470 Support for $regex operator in $filte... Closed
is duplicated by SERVER-34122 Pattern Search Support in $filter & $... Closed
is duplicated by SERVER-9159 Use Regex capture groups with project... Closed
is duplicated by SERVER-8892 Use $regex as the expression in a $cond Closed
Related
related to SERVER-39694 Implement $regexMatch as syntactic su... Closed
related to SERVER-9156 Projection by a substring match Closed
related to SERVER-13902 Reverse regex functionality for queries Closed
related to SERVER-22104 $instr function to locate position of... Closed
related to SERVER-8951 Add $findChar or $indexOf operator fo... Closed
related to SERVER-39695 Implement $regexFind Closed
related to SERVER-39696 Implement $regexFindAll Closed
is related to SERVER-36261 Support field projection based on str... Backlog
is related to SERVER-33389 aggregate function similar to regex r... Closed
Backwards Compatibility: Fully Compatible
Participants:
Case:

 Description   
Issue Status as of May 10, 2019

FEATURE DESCRIPTION
This feature adds three new expressions $regexFind, $regexFindAll and $regexMatch to the aggregation language. The $regexFind and $regexFindAll expressions allows regex matching and capturing. $regexMatch is a syntactic sugar on top of $regexFind which can be used for regex matching.

VERSIONS
This feature is available in the 4.1.11 and newer development versions of MongoDB, and in the 4.2 and newer production releases.

RATIONALE
Regex search is a powerful feature of the match language, but does not exist within the aggregation framework. This would unlock many use cases of string manipulation, and bring the two languages closer together. MongoDB Stitch would also be able to leverage this expression to allow users to define visibility rules using regular expressions.

OPERATION

Syntax

Input

{$regexFind:{             // returns the first match found
    input: <expression>,
    regex: <expression>,
    options: <expression> // optional
}}
 
{$regexFindAll:{          // returns every match
    input: <expression>,
    regex: <expression>,
    options: <expression> // optional
}}
 
{$regexMatch:{          // returns true/false
    input: <expression>,
    regex: <expression>,
    options: <expression> // optional
}}

input: string, or expression evaluating to a string
regex: /pattern/opts, or "string pattern", or expression resolving to a regex type. Does not support the extended json regex syntax of {$regex: <string>, $options: <options>}.
options: “imsx”, or expression resolving to a string

Note that this syntax is different from the syntax used to specify regexes and options elsewhere in the server. The $regex match expression may take the form {$regex: <pattern>, $options: <options>}. The important difference is that we are hoisting the ‘regex’ and ‘options’ field into the top-level object. This lets us avoid repeating “regex” twice, (e.g. {input: “x”, regex: {$regex: “xyz”, $options: “123”)}}. Here are some examples:

{$regexFind: {input:"$text", regex: /pattern/opts}
{$regexMatch: {input:"hello world", regex: "$pathToRegexField"}}
{$regexFindAll: {input:"$text", regex: "pattern", options: “mi”}}

options includes all the regex options currently supported in the match language:
'i' - case insensitive
'm' - newlines match ^ and $
'x' - extended mode (allows for comments, ignores whitespace in the regex, etc.)
's' - allows . to include newline characters

Output

$regexFind will return a single document with the format below, for the leftmost substring in input which matches the regex. If no such substring exists, it will return null. $regexFindAll will return an array of documents (one for each substring in input which matches the regex), each of which have the same format as below. If no matches are found, an empty array will be returned.

$regexFind

{
   match: <string>
   captures: [<string>, <string>, ...]
   idx: <non-negative integer>
}

$regexFindAll

[{
   match: <string>
   captures: [<string>, <string>, ...]
   idx: <non-negative integer>
}, ...]

match: the string that the pattern matched.
captures: an array of substrings within the match captured by parenthesis in the regex pattern, ordered by appearance of the parentheses from left to right. This is an empty array if there were no captures.
idx: a zero-based index indicating where the first char of the match appears in the text field being searched. Represents a code point (not a byte offset).

We will also provide an alias for checking whether any substring matches a regex $regexMatch

$regexMatch is sugar for

{$ne: [ {$regexFind: { <arguments> } }, null ] }

This expression won’t be collation aware, so string comparisons implied by the regex will not match the collation (for example if a collection has a case-insensitive collation, the regex will not “automatically” perform a case-insensitive comparison).

Examples

Basic search with captures
Collection

{_id: 0, text:"Simple example"}

Pipeline

db.coll.aggregate([{
    $project: {
        matches: {
            $regexFindAll: {
                input: "$text",
                regex: “(m(p))”,
            }
        }
    }
}])

Output

{
    _id: 0,
    matches: [
        {
            match: "mp",
            captures: ["mp", "p"],
            idx: 2
        },
        {
            match: "mp",
            captures: ["mp", "p"],
            idx: 10
        }
    ]
}

Email extraction
Collection

{_id: 0,  text:"Some field text with email norberto@mongodb.com"}

Pipeline

db.coll.aggregate([{
    $project: {
        match: {
            $regexFind: {
                input: "$text",
                regex: /([a-zA-Z0-9._-]+)@[a-zA-Z0-9._-]+\.[a-zA-Z0-9._-]+/
            }
        }
    }
}])

Output

{
    _id: 0,
    match: {
        match: "norberto@mongodb.com",
        captures: ["norberto"],
        idx: 27  
    }
}

No matches ($regexFind)
Collection

{_id: 0,  text: "Some text with no matches"}

Pipeline

db.coll.aggregate([{
    $project: {
        match: {
            $regexFind: {
                input: "$text",
                regex:/not present/
            }
        }
    }
}])

Output

{_id: 0, match: null}

No matches ($regexFindAll)
Collection

{_id: 0,  text: "Some text with no matches"}

Pipeline

db.coll.aggregate([{
    $project: {
        matches: {
            $regexFindAll: {
                input: "$text",
                regex:/not present/
            }
        }
    }
}])

Output

{_id: 0, matches: []}

Using regex stored in the document
Collection

{_id: 0, text: "text with 02 digits", regexField: /[0-9]+/}

Pipeline

db.coll.aggregate([{
    $project: {
        match: {
            $regexFind: {
                input: "$text",
                regex: "$regexField",
            }
        }
    }
}])

Output

{_id: 0, match: {match: "02", captures: [], idx: 10}}

Using $regexMatch in a $cond
Collection

{_id: 0, phoneNumber: "212-456-7890"}
{_id: 1, phoneNumber: "1-800-212-000"}

Pipeline

db.coll.aggregate([{
    $project: {
        region: {
            $cond: {
                if: {
                    $regexMatch: {
                        input: “$phoneNumber”,
                        regex: “^212.*$”,
                    }               
                }
                 then: "New York",
                else: "Somewhere Else"
            }
        }
    }
}])

Output

{_id: 0, region: “New York”}
{_id: 1, region: “Somewhere Else”}

Non-overlapping captures
Input

{_id: 0, text:"aaaaa"}

Pipeline

db.coll.aggregate([{
    $project: {
        matches: {
            $regexFindAll: {
                input: "$text",
                regex: “(a*)”,
            }
        }
    }
}])

Output

{
    _id: 0,
    matches: [
        {
            match: "aaaaa",
            captures: [“aaaaa”],
            idx: 0
        },
    ]
}

The purpose of the above example is to show that after a capture is found the search for the next capture will start at the end of the last one (e.g. instead of returning a capture for “a”, “aa”, “aaa” a single capture for “aaaaa” is returned). This matches the behavior provided by python, javascript and other languages. If other behavior is required, the non-greedy ? operator can be used, e.g. /(a+?)/.



 Comments   
Comment by Arun Banala [ 19/Jun/19 ]

Vokail Yes! This feature is currently available in development version 4.1.11 and will be in production release 4.2. You can find an overview of this feature in the user summary box at the top of this ticket. You can find more detailed examples for $regexFind and $regexFindAll in the upcoming release docs section.

Thank you for your continued interest in this feature!

Comment by Vokail [ 19/Jun/19 ]

I'm not sure to understand: this issue is closed and available in version 4.2 ?

Comment by Asya Kamsky [ 14/Feb/19 ]

brucehappy that edit was made in error - you can disregard it. We are still working on this and while we cannot guarantee that it will make 4.2 we are doing our best.

Comment by Bruce Duncan [ 12/Feb/19 ]

For all of us waiting (for years) for this feature to finally be included as part of 4.2, seeing the fix version change to Q3-2019 is more than a little depressing @Asya

Comment by Asya Kamsky [ 29/Jan/19 ]

Vokail:

Starts with and ends with are both doable currently via one of the available string expressions.

Example for "f1" ends with "f2":

db.endswith.aggregate({$match:{$expr:{$eq:[ "$f2", {$substr:[ "$f1", {$subtract:[{$strLenCP:"$f1"},{$strLenCP:"$f2"}]},{$strLenCP:"$f2"}]}]}}})
{ "_id" : ObjectId("5c50d8cb9f70c4365badd84b"), "f1" : "123xxxxyz", "f2" : "xyz" }
{ "_id" : ObjectId("5c50d8db9f70c4365badd84d"), "f1" : "123xxxxyz1abc", "f2" : "abc" }

Note that $expr is available since 3.6 and allows you to use all of string expressions from aggregation.

Similarly start with should compare contents of "f2" with substring of "f1" from 0 to "length of f2".

Comment by Vokail [ 29/Jan/19 ]

I want also to highlight solution above does not fit with "endswith", because indexOfBytes return only first match

Comment by Vokail [ 29/Jan/19 ]

@Asya Kamsky sure, I've provided all information here: https://stackoverflow.com/questions/54365355/mongodb-regex-in-aggregation-using-reference-to-field-value

 

From my understanding, seems to be the only way is to add a field and use {{indexOfBytes }}

Comment by Asya Kamsky [ 28/Jan/19 ]

Vokail you can already use regular regex find syntax during $match stage. Can you clarify your question/use case?

Comment by Vokail [ 24/Jan/19 ]

I'm also looking for this, in particular:

{$regexFind: ["$text", "$pathToRegexField"]}

 

there is a workaround during aggregation and $match operator to use a regex?

 

 

Comment by Asya Kamsky [ 03/Dec/18 ]

This feature is still scheduled to be worked on for 4.2 release however, it's not guaranteed to be in the 4.2 release until this ticket is closed (i.e. the code is committed).  

Comment by Bruce Duncan [ 03/Dec/18 ]

Could I please get an update on whether this feature is still scheduled for release as part of v4.2?

Comment by Bruce Duncan [ 22/Aug/18 ]

@asya Is this issue still on track for 4.1 with stable release in 4.2?

Comment by Wendong Wu [ 26/Mar/18 ]

@Asya Kamsky I think this is for languages which don't support the /pattern/ syntax for regex, for example Python. With the $regex syntax, user can specify the following string is a regex pattern. Although you may argue that in Python user can use re.compile("pattern") when writing the query.

Comment by Asya Kamsky [ 24/Mar/18 ]

charlie.swanson is there a reason we need to accept both /pattern/ and {$regex:} syntax for the regex expression? wan.bachtiar pointed out that to accept "$" prefixed document in aggregation we would need to make changes to parser which otherwise would reject this as unrecognized agg expression. I couldn't think of any reason to accept $regex subdocument syntax, rather than just document that /pattern/opt syntax should be used, can you?

Comment by Charlie Swanson [ 23/Jun/17 ]

Updated description to match our draft syntax proposal/examples

Comment by Charlie Swanson [ 05/Jun/17 ]

Hi brucehappy,

Sorry for the confusion. I looked at your patch and I believe I now understand your desired change. Unfortunately, we cannot accept this change, as the code you changed is the generic implementation of comparing values within the aggregation framework. This will change more than just the behavior of the $eq operator, in ways that are not consistent with elsewhere in the server. The most obvious place is in the sorting semantics. In both the query and aggregation systems, we must be able to sort values of different types, and we do this by putting all regexes after strings (if you trace the calls, you can eventually see that strings have a canonical type code of 10, whereas regexes have a canonical type code of 50, which means regexes will always come after strings. The code for comparisons within the query system is similar, though implemented in bsonelement.cpp).

// Running against 3.4.4, before your changes.
> db.version()
3.4.4
> db.foo.insert([{_id: 0, x: /test/}, {_id: 1, x: "test"}, {_id: 2, x: "a test sentence"}])
BulkWriteResult({/* ... */})
> db.foo.aggregate([{$sort: {x: 1}}])
{ "_id" : 2, "x" : "a test sentence" }
{ "_id" : 1, "x" : "test" }
{ "_id" : 0, "x" : /test/ }
> db.foo.find().sort({x: 1})
{ "_id" : 2, "x" : "a test sentence" }
{ "_id" : 1, "x" : "test" }
{ "_id" : 0, "x" : /test/ }
 
// Restart server, now with 0.0.0, which is my development version of mongod, with your changes applied.
> db.version()
0.0.0
> db.foo.aggregate([{$sort: {x: 1}}])
{ "_id" : 0, "x" : /test/ }
{ "_id" : 2, "x" : "a test sentence" }
{ "_id" : 1, "x" : "test" }
> db.foo.find().sort({x: 1})
{ "_id" : 2, "x" : "a test sentence" }
{ "_id" : 1, "x" : "test" }
{ "_id" : 0, "x" : /test/ }

Changing the sorting behavior in this way doesn't really make sense. Imagine you have a regex r, and strings a, b, and c. Now imagine that a < b < c, but that r matched both a and c. Where should r go in the sort order?

Further, although it is subtle, the aggregation system's implementation of $eq is actually consistent with the query system's implementation. I understand and sympathize with the confusion, but the match expression {a: /regex/} is actually different than the match expression {a: {$eq: /regex/}}. The {a: /regex/} syntax is really just a shorthand for {a: {$regex: "regex"}}, which I think is pretty confusing:

> db.foo.find({x: /test/})
{ "_id" : 0, "x" : /test/ }
{ "_id" : 1, "x" : "test" }
{ "_id" : 2, "x" : "a test sentence" }
> db.foo.find({x: {$eq: /test/}})
{ "_id" : 0, "x" : /test/ }
> db.foo.find({x: {$regex: "test"}})
{ "_id" : 0, "x" : /test/ }
{ "_id" : 1, "x" : "test" }
{ "_id" : 2, "x" : "a test sentence" }

The aggregation comparison is actually consistent with this later syntax, which is parsed into an equality match expression, rather than a regex match expression. We don't have an equivalent expression within the aggregation framework, which I think is the most obvious way to gain this functionality without fundamentally changing comparisons between values of different types.

With all this context, I think we will need to reject the pull request, and work to provide some sort of $regex expression within the aggregation framework which will give you the functionality you need. We are currently reviewing a draft proposal for this change, which describes the addition of new $regexMatch and $regexSearch expressions. We will provide more details on the desired syntax and behavior once the draft has been approved.

Best,
Charlie

Comment by Bruce Duncan [ 26/May/17 ]

Hello charlie.swanson,
As I indicated in my comment above, my desired use is best described by SERVER-8892, and that was the functionality my PR implements. However, that issue had been marked as a duplicate of this issue and closed, and thus I linked my PR to this one.

What I have implemented is not a new operator, but rather a more modest change to alter how the comparison is performed between a String/Symbol type and a RegEx type in $project and $group. I would argue that my change is actually a bug fix for the very surprising existing behavior of having a regex in a $cond or other comparison not actually execute against the string when inside one of these boolean expressions. I can certainly understand that people would want a new/modified operator to do capturing groups and other regex match stuff in a context outside of a boolean expression, and in fact that would help me with some other features I am working on, but to me, the creation of that new operator (as expressed in this JIRA issue, SERVER-11947) should be considered as a completely different issue than what I've done.

I would suggest reopening SERVER-8892, and consider my PR relative to that issue. I am sure that there is a non-trivial amount of work needed to properly design and implement the new operator(s) needed for SERVER-11947, so I hope that my modest PR could be considered outside that work, and get integrated much more quickly.

Thanks
bruce

Comment by Charlie Swanson [ 26/May/17 ]

Hi brucehappy, Thank you so much for your pull request!

We are always careful to give careful thought to the syntax and semantics when adding a new operator to the aggregation language. We want to make sure it is simple to use, has the right semantics that users will not find surprising, and will be genuinely useful. As part of this process, we need to go through some internal review to agree on what we want the syntax and semantics should be. I've taken over this process and will try to speed this discussion along so we can get to your pull request!

I would like to warn you that the syntax we agree upon is often different than the simplistic use-case pointed out in the original report (SERVER ticket). In this particular case, we'll at least consider adding support for capture groups (SERVER-9159), the index of the match, accepting a regex from elsewhere in the document (SERVER-13902), etc. If it strays too much from your pull request and you are not interested in updating your changes, we can consider accepting what you have implemented and can extend/modify it ourselves. But we'll see what we agree on and I will get back to you!

Thank you, and don't hesitate to ask if you have any questions.
Charlie

Comment by Kelsey Schubert [ 19/May/17 ]

Hi brucehappy,

Thank you for the pull request! We'll review it and provide our comments.

Kind regards,
Thomas

Comment by Bruce Duncan [ 17/May/17 ]

I have submitted a PR to implement this feature here:
https://github.com/mongodb/mongo/pull/1150
I am attaching an example JS that can be run against a mongodb with and without my changes. Without the changes the output looks like this:

{ "_id" : "Misc Series", "count" : 10 }
{ "item" : "a", "series" : "Misc Series" }
{ "item" : "a01", "series" : "Misc Series" }
{ "item" : "a1", "series" : "Misc Series" }
{ "item" : "a9", "series" : "Misc Series" }
{ "item" : "b1", "series" : "Misc Series" }
{ "item" : "c1", "series" : "Misc Series" }
{ "item" : "c1cccc", "series" : "Misc Series" }
{ "item" : "d3", "series" : "Misc Series" }
{ "item" : "e4", "series" : "Misc Series" }
{ "item" : "f5", "series" : "Misc Series" }

And with the changes the output looks like this:

{ "_id" : "A Series", "count" : 4 }
{ "_id" : "B or C Series", "count" : 3 }
{ "_id" : "Misc Series", "count" : 3 }
{ "item" : "a", "series" : "A Series" }
{ "item" : "a01", "series" : "A Series" }
{ "item" : "a1", "series" : "A Series" }
{ "item" : "a9", "series" : "A Series" }
{ "item" : "b1", "series" : "B or C Series" }
{ "item" : "c1", "series" : "B or C Series" }
{ "item" : "c1cccc", "series" : "B or C Series" }
{ "item" : "d3", "series" : "Misc Series" }
{ "item" : "e4", "series" : "Misc Series" }
{ "item" : "f5", "series" : "Misc Series" }

Looking forward to working with backlog-server-query to get this integrated.

Comment by Bruce Duncan [ 16/May/17 ]

I completely agree with wendwu. Having a regular expression as part of the $project or $group stages in a pipeline would be extremely helpful in doing the sort of bucketing that wendwu demonstrates above, which cannot be accomplished once the regular expression used is complex (cannot be implementing using $split, etc).

But it should be noted that the description of this issue is actually requesting something other than regular expression value/string value equality support in the $project and $group stages of the aggregation pipeline. It is requesting support for regular expression match extraction during $project. SERVER-9159 is clearly a duplicate of the original issue. However, SERVER-8892 is a duplicate of what wendwu and I would like to see, which is quite different.

Comment by Wendong Wu [ 02/May/17 ]

asya Thanks for the suggestion. I believe $indexOfCP and $split can solve most of the pattern matching/searching problems. But I think regex is useful in many cases, especially when the patterns get more complicated than splitting the "_". Though we may still find a way to use nested $or, $and expressions to simulate the logic, it is easier to use $regex if we have it supported.

Comment by Asya Kamsky [ 02/May/17 ]

wendwu you can use $split expression to get the same substring here:

db.items.aggregate({$addFields:{baseName:{$arrayElemAt:[{$split:["$name","_"]},0]}}})
{ "_id" : 1, "name" : "novel_1", "qty" : 15, "baseName" : "novel" }
{ "_id" : 2, "name" : "magazine_1", "qty" : 5, "baseName" : "magazine" }
{ "_id" : 3, "name" : "novel_2", "qty" : 5, "baseName" : "novel" }
{ "_id" : 4, "name" : "guitar_1", "qty" : 10, "baseName" : "guitar" }
{ "_id" : 5, "name" : "violin_1", "qty" : 10, "baseName" : "violin" }

Comment by Wendong Wu [ 02/May/17 ]

This feature is very useful in preprocessing the data and then send it to the $group pipeline afterwards. For example for a collection like this,

{ _id: 1, name: "novel_1", qty: 15}
{ _id: 2, name: "magazine_1", qty: 5}
{ _id: 3, name: "novel_2", qty: 5}
{ _id: 4, name: "guitar_1", qty: 10}
{ _id: 5, name: "violin_1", qty: 10}

We can have the aggregation pipeline go like this to get the categorize the data and calculate the count for each category.

db.items.aggregate([
    {$project: {category: {
        $switch: {
            branches: [
                // use regex here to categorize the items by their name
                {case: {$in: ['$name', [/magazine/, /novel/]]},
                    then: 'book'},
                {case: {$in: ['$name', [/guitar/, /violin/]]},
                    then: 'instrument'}
            ],
            default: 'others'
        }
    }}},
    // get the group-by count based on the category
    {$group: {
        _id: {category: '$category'},
        count: {$sum: '$qty'}
    }}
]);

Generated at Thu Feb 08 03:27:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.