Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-11947

Add a regex expression to the aggregation language

    XMLWordPrintable

    Details

      Description

      Issue Status as of May 10, 2019

      FEATURE DESCRIPTION
      This feature adds three new expressions $regexFind, $regexFindAll and $regexMatch to the aggregation language. The $regexFind and $regexFindAll expressions allows regex matching and capturing. $regexMatch is a syntactic sugar on top of $regexFind which can be used for regex matching.

      VERSIONS
      This feature is available in the 4.1.11 and newer development versions of MongoDB, and in the 4.2 and newer production releases.

      RATIONALE
      Regex search is a powerful feature of the match language, but does not exist within the aggregation framework. This would unlock many use cases of string manipulation, and bring the two languages closer together. MongoDB Stitch would also be able to leverage this expression to allow users to define visibility rules using regular expressions.

      OPERATION

      Syntax

      Input

      {$regexFind:{             // returns the first match found
          input: <expression>,
          regex: <expression>,
          options: <expression> // optional
      }}
       
      {$regexFindAll:{          // returns every match
          input: <expression>,
          regex: <expression>,
          options: <expression> // optional
      }}
       
      {$regexMatch:{          // returns true/false
          input: <expression>,
          regex: <expression>,
          options: <expression> // optional
      }}
      
      

      input: string, or expression evaluating to a string
      regex: /pattern/opts, or "string pattern", or expression resolving to a regex type. Does not support the extended json regex syntax of {$regex: <string>, $options: <options>}.
      options: “imsx”, or expression resolving to a string

      Note that this syntax is different from the syntax used to specify regexes and options elsewhere in the server. The $regex match expression may take the form {$regex: <pattern>, $options: <options>}. The important difference is that we are hoisting the ‘regex’ and ‘options’ field into the top-level object. This lets us avoid repeating “regex” twice, (e.g. {input: “x”, regex: {$regex: “xyz”, $options: “123”)}}. Here are some examples:

      {$regexFind: {input:"$text", regex: /pattern/opts}
      {$regexMatch: {input:"hello world", regex: "$pathToRegexField"}}
      {$regexFindAll: {input:"$text", regex: "pattern", options: “mi”}}
      

      options includes all the regex options currently supported in the match language:
      'i' - case insensitive
      'm' - newlines match ^ and $
      'x' - extended mode (allows for comments, ignores whitespace in the regex, etc.)
      's' - allows . to include newline characters

      Output

      $regexFind will return a single document with the format below, for the leftmost substring in input which matches the regex. If no such substring exists, it will return null. $regexFindAll will return an array of documents (one for each substring in input which matches the regex), each of which have the same format as below. If no matches are found, an empty array will be returned.

      $regexFind

      {
         match: <string>
         captures: [<string>, <string>, ...]
         idx: <non-negative integer>
      }
      

      $regexFindAll

      [{
         match: <string>
         captures: [<string>, <string>, ...]
         idx: <non-negative integer>
      }, ...]
      

      match: the string that the pattern matched.
      captures: an array of substrings within the match captured by parenthesis in the regex pattern, ordered by appearance of the parentheses from left to right. This is an empty array if there were no captures.
      idx: a zero-based index indicating where the first char of the match appears in the text field being searched. Represents a code point (not a byte offset).

      We will also provide an alias for checking whether any substring matches a regex $regexMatch

      $regexMatch is sugar for

      {$ne: [ {$regexFind: { <arguments> } }, null ] }
      

      This expression won’t be collation aware, so string comparisons implied by the regex will not match the collation (for example if a collection has a case-insensitive collation, the regex will not “automatically” perform a case-insensitive comparison).

      Examples

      Basic search with captures
      Collection

      {_id: 0, text:"Simple example"}
      
      

      Pipeline

      db.coll.aggregate([{
          $project: {
              matches: {
                  $regexFindAll: {
                      input: "$text",
                      regex: “(m(p))”,
                  }
              }
          }
      }])
      

      Output

      {
          _id: 0,
          matches: [
              {
                  match: "mp",
                  captures: ["mp", "p"],
                  idx: 2
              },
              {
                  match: "mp",
                  captures: ["mp", "p"],
                  idx: 10
              }
          ]
      }
      

      Email extraction
      Collection

      {_id: 0,  text:"Some field text with email norberto@mongodb.com"}
      

      Pipeline

      db.coll.aggregate([{
          $project: {
              match: {
                  $regexFind: {
                      input: "$text",
                      regex: /([a-zA-Z0-9._-]+)@[a-zA-Z0-9._-]+\.[a-zA-Z0-9._-]+/
                  }
              }
          }
      }])
      

      Output

      {
          _id: 0,
          match: {
              match: "norberto@mongodb.com",
              captures: ["norberto"],
              idx: 27  
          }
      }
      

      No matches ($regexFind)
      Collection

      {_id: 0,  text: "Some text with no matches"}
      

      Pipeline

      db.coll.aggregate([{
          $project: {
              match: {
                  $regexFind: {
                      input: "$text",
                      regex:/not present/
                  }
              }
          }
      }])
      

      Output

      {_id: 0, match: null}
      

      No matches ($regexFindAll)
      Collection

      {_id: 0,  text: "Some text with no matches"}
      

      Pipeline

      db.coll.aggregate([{
          $project: {
              matches: {
                  $regexFindAll: {
                      input: "$text",
                      regex:/not present/
                  }
              }
          }
      }])
      

      Output

      {_id: 0, matches: []}
      

      Using regex stored in the document
      Collection

      {_id: 0, text: "text with 02 digits", regexField: /[0-9]+/}
      

      Pipeline

      db.coll.aggregate([{
          $project: {
              match: {
                  $regexFind: {
                      input: "$text",
                      regex: "$regexField",
                  }
              }
          }
      }])
      

      Output

      {_id: 0, match: {match: "02", captures: [], idx: 10}}
      

      Using $regexMatch in a $cond
      Collection

      {_id: 0, phoneNumber: "212-456-7890"}
      {_id: 1, phoneNumber: "1-800-212-000"}
      

      Pipeline

      db.coll.aggregate([{
          $project: {
              region: {
                  $cond: {
                      if: {
                          $regexMatch: {
                              input: “$phoneNumber”,
                              regex: “^212.*$”,
                          }               
                      }
                       then: "New York",
                      else: "Somewhere Else"
                  }
              }
          }
      }])
      

      Output

      {_id: 0, region: “New York”}
      {_id: 1, region: “Somewhere Else”}
      

      Non-overlapping captures
      Input

      {_id: 0, text:"aaaaa"}
      

      Pipeline

      db.coll.aggregate([{
          $project: {
              matches: {
                  $regexFindAll: {
                      input: "$text",
                      regex: “(a*)”,
                  }
              }
          }
      }])
      

      Output

      {
          _id: 0,
          matches: [
              {
                  match: "aaaaa",
                  captures: [“aaaaa”],
                  idx: 0
              },
          ]
      }
      

      The purpose of the above example is to show that after a capture is found the search for the next capture will start at the end of the last one (e.g. instead of returning a capture for “a”, “aa”, “aaa” a single capture for “aaaaa” is returned). This matches the behavior provided by python, javascript and other languages. If other behavior is required, the non-greedy ? operator can be used, e.g. /(a+?)/.

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                42 Vote for this issue
                Watchers:
                49 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: