[SERVER-22580] Add $cpLength and $cpSubstr expressions which work via code points Created: 11/Feb/16  Updated: 14/Mar/17  Resolved: 25/Mar/16

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: 3.3.4

Type: New Feature Priority: Major - P3
Reporter: Charlie Swanson Assignee: Benjamin Murphy
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by CSHARP-1622 Add $cpLength and $cpSubstr expressio... Closed
Documented
is documented by DOCS-8550 Document $substrBytes/$substrCP Closed
Related
related to DRIVERS-297 Aggregation Framework Support for 3.4 Closed
Backwards Compatibility: Fully Compatible
Sprint: Query 12 (04/04/16)
Participants:
Linked BF Score: 0

 Description   

Syntax

{$substrBytes: [ <string>, <expression>, <expression>] }
{$substrCP: [ <string>, <expression>, <expression>] }

Examples

Input

{_id: 0, string: "ελληνικά"}

Pipeline

db.coll.aggregate([{
    $project: {
        byteSubstr: {$substrBytes: ["$string", 0, 4]},
        cpSubstr: {$substrCP: ["$string", 0, 4]}
    }
}])

Output

{_id: 0, byteSubstr: "ελ", cpSubstr: "ελλη"}

Additional Notes

  • Will not add any new query functionality to work with strings.
  • $substrBytes will error if it starts or ends in the middle of a code point.
  • $substrCP will error on any input that is detected to be invalid UTF-8.

Original Description

The current expression $substr, and the proposed expression $length (see SERVER-14670) will work in terms of bytes in the string. Sometimes it is desirable to work in terms of code points instead, so we should add the equivalent expressions that will work with code points.

For example, {$substr: ["\uD834\uDF06", 0, 1]} would be an error (since the second is a continuation byte), but {$cpSubstr: ["\uD834\uDF06", 0, 1]} would be "\uD834\uDF06".

Correspondingly, {$length: "\uD834\uDF06"} would be 2, but {$cpLength: "\uD834\uDF06"} would be 1.



 Comments   
Comment by Githook User [ 25/Mar/16 ]

Author:

{u'username': u'benjaminmurphy', u'name': u'Benjamin Murphy', u'email': u'benjamin_murphy@me.com'}

Message: SERVER-22580 Remove invalid UTF-8 from log messages
Branch: master
https://github.com/mongodb/mongo/commit/6dc3a9bdf1f4049c15761a912348f4c34404f88f

Comment by Benjamin Murphy [ 25/Mar/16 ]

This ticket introduced the $substrCP expression, with syntax as described in the description. It needs documentation, and any drivers that provide aggregation framework helpers should be updated to include this new expression.

In addition, this patch deprecated $substr in favor of $substrBytes, which has identical functionality.

Comment by Githook User [ 25/Mar/16 ]

Author:

{u'username': u'benjaminmurphy', u'name': u'Benjamin Murphy', u'email': u'benjamin_murphy@me.com'}

Message: SERVER-22580 Aggregation now supports substrCP.
Branch: master
https://github.com/mongodb/mongo/commit/5afa97da4ce5049ef7eb8bf4717ce37bd6777754

Comment by Charlie Swanson [ 11/Mar/16 ]

After some internal discussion, I've updated the description to match the agree-upon design.

Comment by Mathias Stearn [ 16/Feb/16 ]

Your example is a bit confusing because it uses UTF-16 surrogate pairs. It may be simpler to rewrite it using either the UTF-8 bytes or UTF-32 whole code points. I think your $length example would actually return 4 since that is the number of UTF8 bytes it takes to represent that single non-BMP codepoint http://www.charbase.com/1d306-unicode-tetragram-for-centre.

Generated at Thu Feb 08 04:00:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.