[SERVER-8951] Add $findChar or $indexOf operator for strings to find position of specific character (or substring) Created: 12/Mar/13  Updated: 22/Mar/17  Resolved: 29/Apr/16

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: 3.3.6

Type: Improvement Priority: Major - P3
Reporter: Asya Kamsky Assignee: Benjamin Murphy
Resolution: Done Votes: 16
Labels: expression
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by CSHARP-1661 Add $findChar or $indexOf operator fo... Closed
Documented
is documented by DOCS-8529 Create 3.4 manual page for $indexOfXXX Closed
Duplicate
is duplicated by SERVER-22104 $instr function to locate position of... Closed
is duplicated by SERVER-23943 $substr to locate occurrence of a "pa... Closed
Related
is related to DRIVERS-297 Aggregation Framework Support for 3.4 Closed
is related to SERVER-9156 Projection by a substring match Closed
is related to SERVER-11947 Add a regex expression to the aggrega... Closed
Backwards Compatibility: Fully Compatible
Sprint: Query 12 (04/04/16), Query 13 (04/22/16), Query 14 (05/13/16)
Participants:

 Description   

Syntax

{$indexOfBytes: [<string>, <search value>, <start index - optional>, <end index - optional>]}
{$indexOfCP: [<string>, <search value>, <start index - optional>, <end index - optional>]}
{$indexOfArray: [<array>, <search value>, <start index - optional>, <end index - optional>]}

Examples

> db.coll.insert([
  {_id: 1, string: "hello world"}
]);
> db.coll.aggregate([{
  $project: {
    location: {$indexOfBytes: ["$string", "world"]}
  }
}]);
{_id: 1, location: 6}
 
// Example 2 - differentiating code points vs. bytes.
> db.coll.insert([
  {_id: 1, string: "øle"}
]);
> db.coll.aggregate([{
  $project: {
    byteLocation: {$indexOfBytes: ["$string", "le"]},
    cpLocation: {$indexOfCP: ["$string", "le"]}
  }
}]);
{_id: 1, byteLocation: 2, cpLocation: 1}
 
// Example 3 - using the start index.
> db.coll.insert([
  {_id: 1, string: "PREFIX|text with word FIX"},  // Contains "FIX", Should match.
  {_id: 2, string: "PREFIX|text without target"}  // Should not match.
]);
> db.coll.aggregate([{
  $project: {
    containsFix: {$indexOfCP: ["$string", "fix", {$strLenCP: "PREFIX|"}]}
  }
}]);
{_id: 1, containsFix: 22}
{_id: 2, containsFix: -1}
 
 
// Example 4 - using the start and end indices.
> db.coll.insert([
  {_id: 1, string: "PREFIX|text with word FIX|SUFFIX"},  // Contains "FIX", Should match.
  {_id: 2, string: "PREFIX|text without target|SUFFIX"}  // Should not match.
]);
> db.coll.aggregate([{
  $project: {
    containsFix: {
      $let: {
        vars: {
          startIndex: {$strLenCP: "PREFIX|"},  // 7
          endIndex: {$subtract: [0, {$strLenCP: "|SUFFIX"}]}  // -7
        },
        in: {$indexOfCP: ["$string", "fix", "$$startIndex", "$$endIndex"]}
      }
    }
  }
}]);
{_id: 1, containsFix: 22}
{_id: 2, containsFix: -1}

Notes

  • Same functionality as Python's find(). Returns -1 if there were no occurrences.

Errors

  • For indexOfBytes/indexOfCP, if the first two arguments are not strings. For indexOfArray, if the first argument is not an array.
  • If either of the last two arguments are not integral.

Old Description
Would like to have some operator to find character (or substring) in string. Use case - normalizing strings like server names "server1.foo.bar" to short names via $substr operator (without knowing location of first '.')



 Comments   
Comment by Benjamin Murphy [ 29/Apr/16 ]

This ticket introduces $indexOfArray, $indexOfBytes, and $indexOfCP, with syntax in the description. It will need to be documented, and any drivers that support aggregation helpers will need to be updated to support it.

Comment by Githook User [ 29/Apr/16 ]

Author:

{u'username': u'benjaminmurphy', u'name': u'Benjamin Murphy', u'email': u'benjamin_murphy@me.com'}

Message: SERVER-8951 Aggregation now supports the indexOfArray, indexOfBytes, and indexOfCP expressions.
Branch: master
https://github.com/mongodb/mongo/commit/7ae631410d8ffe71c74f96d5ab5dd408764b7858

Comment by Asya Kamsky [ 19/Apr/16 ]

This ticket is for expression for aggregation framework. $regex already exists and works in queries. If there is new functionality that's needed for queries that can't be done with $regex please open a new SERVER ticket with description of the use case/example of query that would be needed.

Comment by Shakir Sadikali [ 19/Apr/16 ]

I'd like this added to more than the aggregation framework. In the general case, 2 functions

  1. returns the position of "substring" within "string"
  2. returns the "substring" within "string" starting at position A and ending at position B

Have this work both in the aggregation framework and in "normal" queries (optionally support regex-like syntax?)

This would be ideal, at least in the use-cases I've run into.

Comment by Andrew Shevchuk [ 24/Feb/16 ]

Would be nice to have a string occurrence operator(less expensive than regex), like this:
db.coll.find( { fullName :

{ $substr : "server1" }

} ).

Comment by Charlie Swanson [ 15/Jan/16 ]

SERVER-22104 points out that it might also be nice to include options to start the search at an index other than 0, and to search for the ith occurrence of the string.

I'll also note that (like other string manipulations) this gets tricky when dealing with UTF-8 characters.

Comment by Asya Kamsky [ 25/Sep/15 ]

There is a workaround described here:

http://www.kamsky.org/stupid-tricks-with-mongodb/ugly-way-to-parse-a-string-with-aggregation-framework

Comment by John A. De Goes [ 16/Jun/15 ]

This would be very helpful to me. Although even more helpful is a way to convert a string into an array of characters (and an array of characters into a string), where of course a character is modeled as a length 1 string.

Then the new array operators for mapping, filtering, and reducing could be used to perform string manipulation, too.

The alternative is to have parallel versions of operators for strings and arrays. In general, you want to do the same sort of things on both (finding things, subsetting, filtering, etc.).

Comment by Yuri Finkelstein [ 03/Nov/13 ]

I run into the same limitation with almost identical use case - need to extract the first token from a string where token boundaries are "." - whatever the symbol is does not matter. I agree with Scott that this class of problems requires a generic solution allowing for more powerful string manipulation in $project (and $group !). Perhaps - embed AWK or something like this
Jokes aside - something needs to be done as today the operators are very limited.

Comment by Asya Kamsky [ 16/Mar/13 ]

The thing I was specifically trying to do was shorten a string into first "part" of it.

I had server names as FQDNS server1.foo.bar.com server2.foo.bar.com, cfg1.foo.bar.com and I wanted to $project them to be just machine name (server1, server2, cfg1) - we have $substr for getting a substring but there is no way to find out where the first delimeter ('.' in this case) is. If there was an operator which returned the first occurrence of a character or a string in another string existed (say it was called $index or $find) I could
project:{shortName: {$substr:["$fullName",0,{$index:["$fullName","."]} ]} } (this is assuming position returned is zero based, which I would hope it would be)

Comment by Scott Hernandez (Inactive) [ 16/Mar/13 ]

What we need is an expression language, including string manipulation, which can execute functions and apply transformations/logic for projections and in other places.

Can you provide a use case? "$index" doesn't make a lot of sense to me.

Comment by Asya Kamsky [ 16/Mar/13 ]

To clarify, I mean for $project phase. Maybe $index is a better name (less overloading of meaning)

Generated at Thu Feb 08 03:18:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.