Uploaded image for project: 'Documentation'
  1. Documentation
  2. DOCS-12669

Docs for SERVER-11947: Add a regex expression to the aggregation language

      Description

      Description:

      No documentation summary in engineering ticket

      Engineering Ticket Description:

      Motivation

      Regex search is a powerful feature of the match language, but does not exist within the aggregation framework. This would unlock many use cases of string manipulation, and bring the two languages closer together.

      Syntax Proposal (subject to change)

      Any of the following:

      {$regexFind: ["$text", /pattern/opts]}
      {$regexFind: ["$text", {$regex: /pattern/opts}}]}
      {$regexFind: ["$text", {$regex: "pattern", $options: "opts"}}]}
      {$regexFind: ["$text", "$pathToRegexField"]}
      

      Examples

      Basic search with captures
      Collection Pipeline Output
      {_id: 0, text:"Simple example"}
      
      db.coll.aggregate([{
          $project: {
              matches: {
                  $regexFind: [
                      "$text",
                      /(m(p))/g
                  ]
              }
          }
      }])
      
      {
          _id: 0,
          matches: [
              {
                  match: "mp",
                  captures: [mp, p],
                  idx: 5
              },
              {
                  match: "mp",
                  captures: [mp, p],
                  idx: 12
              }
          ]
      }
      
      Email extraction
      Collection Pipeline Output
      {
        _id: 0,
        text: "Some field text with email test@mongodb.com"
      }
      
      db.coll.aggregate([{
          $project: {
              matches: {
                  $regexFind: [
                      "$text",
                      /([a-zA-Z0-9._-]+)@[a-zA-Z0-9._-]+\.[a-zA-Z0-9._-]+/
                  ]
              }
          }
      }])
      
      {
          _id: 0,
          matches: [
              {
                  match: "test@mongodb.com",
                  captures: ["test"],
                  idx: 27  
              }
          ]
      }
      
      No matches
      Collection Pipeline Output
      {_id: 0,  text: "Some text with no matches"}
      
      db.coll.aggregate([{
          $project: {
              matches: {
                  $regexFind: [
                      "$text",
                      /not present/
                  ]
              }
          }
      }])
      
      {_id: 0, matches: []}
      
      Using regex stored in the document
      Collection Pipeline Output
      {
        _id: 0,
        text: "text with 02 digits",
        regexField: /[0-9]+/
      }
      
      db.coll.aggregate([{
          $project: {
              matches: {
                  $regexFind: [
                      "$text",
                      "$regexField",
                  ]
              }
          }
      }])
      
      {
        _id: 0,
        matches: [
          {match: "02", captures: [], idx: 10}
        ]
      }
      
      Using $regexFind to perform a test inside a $cond
      Collection Pipeline Output
      {
        _id: 0,
        text: "special clearance",
        privateInfo: "secret info"
      }
      {
        _id: 1,
        text: "locked down",
        privateInfo: "secret"
      }
      
      db.coll.aggregate([{
          $project: {
              privateInfo: {
                  $cond: [
                      {$eq: [
                          {$size: {$regexFind: ["$text", /clearance/]}},
                          0
                      ]},
                      "REDACTED",
                      "$privateInfo"
                  ]
              }
          }
      }])
      
      {_id: 0, privateInfo: "secret info"}
      {_id: 1, privateInfo: "REDACTED"}
      

      Notes

      • Naming: There is some inconsistency across languages/databases about whether a regex 'match' should consume the entire string or search throughout the string (think /^pattern$/ instead of /pattern/) so we tried to avoid the term 'search' or 'match'. Notes from a comment chain earlier:
        • It looks like there isn't really consensus amongst programming languages or database vendors about what regex 'search' means relative to regex 'match'. For example, java, and go have a match() or matches() which is a full/exact match and a find() which searches. Python has search() instead of find, with the same distinction. But .NET and javascript both use match() to mean 'find anywhere in the string'.
          In the database world, SQL has a 'LIKE' operator that needs to be an exact match, and users must preffix/suffix with the '%' character to match the pattern anywhere in the string. However, MySQL has it's own REGEXP operator that will look for partial matches within the string.
          With all this context, how about we go with $regexFind, since that is never overloaded to anything except "find anywhere within the string"? I don't think we need to add a version that's an exact match, but this would allow us to in the future without stepping on our feet.
      • Each result document contains a match, captures and idx field. Default is to only match the first result, but if the global flag is specified ('g'), all matches will be returned.
        • match: the string that the pattern matched.
        • captures: an array of substrings within the match captured by parenthesis in the regex pattern, ordered by appearance of the parentheses from left to right, limit 10. This is an empty array if there were no captures.
        • idx: where the first char of the match appears in the text field being searched.
      • Options include all the supported query opts:
        • Existing
          • 'i' - case insensitive
          • 'm' - newlines match ^ and $
          • 'x' - extended mode (allows for comments, ignores whitespace in the regex, etc.)
          • 's' - allows . to include newline characters
        • New
          • 'g' - will match all occurrences of the pattern. This is not part of PCRE, and would be parsed by us. Alternatively, we could take an additional argument, or make a different operator.
      • You can currently search for an exact string match using $indexOf(CP|Bytes):
        > db.bar.insert([{_id: 0, text: "has sentinel"}, {_id: 1, text: "no match"}])
        BulkWriteResult({...})
        > db.bar.aggregate([{$project: {hasSentinel: {$indexOfCP: ["$text", "sentinel"]}}}])
        { "_id" : 0, "hasSentinel" : 4 }
        { "_id" : 1, "hasSentinel" : -1 }
        

      Errors

      • If text option does not resolve to a string (will error if given symbol).
        • One exception (for consistency with other expressions): If text option is nullish, returns null.
      • If regex option does not resolve to a regex type or an object with a $regex field.
        • One exception (for consistency with other expressions): If regex option is nullish, returns null.
      • If regex option is an object with a $regex field, but has another field specified that is not $options.
      • If given a regex option other than ('i', 'm', 'x', 's', 'g')
        • This is different than the query operator, which silently ignores invalid options:
          > db.bar.find()
          { "_id" : 0, "text" : "has sentinel" }
          { "_id" : 1, "text" : "no match" }
          > db.bar.find({text: {$regex: /sentinel/, $options: "abc"}})
          { "_id" : 0, "text" : "has sentinel" }
          
      • If given both /re/opts style options, and $options
        • This is different than the query operator, which will prefer $options:
          > db.baz.find()
          { "_id" : 0, "text" : "text\nwith\nNewLines" }
          { "_id" : 1, "text" : "newlines\nwith\ntext" }
          { "_id" : 2, "text" : "NewLines\nwith\ntext" }
          { "_id" : 4, "text" : "text\nwith\nnewlines" }
          > db.baz.find({text: {$regex: /^newlines/m, $options: "i"}})
          { "_id" : 1, "text" : "newlines\nwith\ntext" }
          { "_id" : 2, "text" : "NewLines\nwith\ntext" }
          
      • If given a malformed regex (e.g. {$regex: "[0-9"})
        • This is different than the query operator, which silently ignores a malformed regex.
          > db.baz.find()
          { "_id" : 0, "text" : "text with 02 digits" }
          { "_id" : 1, "text" : "[0-9" }
          > db.baz.find({text: {$regex: "[0-9"}})
          
      Original Description

      would be great to be able to have a projection that would filter data out of a regular expression:

      db.example.insert({_id: 1,  text:"Some field text with email norberto@mongodb.com    "   })
      db.example.aggregate(  {$project: {  extracted: {  $regexmatch: [ "$text", '[a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9._-]+' ]       }   }  } 
      

      where the output could be

      {_id: 1,
       extracted: ["norberto@mongodb.com"]
      }
      
      

      Scope of changes

      Impact to Other Docs

      MVP (Work and Date)

      Resources (Scope or Design Docs, Invision, etc.)

            Assignee:
            Unassigned Unassigned
            Reporter:
            kay.kim@mongodb.com Kay Kim (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved:
              4 years, 45 weeks, 2 days ago