[SERVER-22580] Add $cpLength and $cpSubstr expressions which work via code points Created: 11/Feb/16 Updated: 14/Mar/17 Resolved: 25/Mar/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Aggregation Framework |
| Affects Version/s: | None |
| Fix Version/s: | 3.3.4 |
| Type: | New Feature | Priority: | Major - P3 |
| Reporter: | Charlie Swanson | Assignee: | Benjamin Murphy |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||
| Sprint: | Query 12 (04/04/16) | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Linked BF Score: | 0 | ||||||||||||||||||||||||
| Description |
Syntax
ExamplesInput
Pipeline
Output
Additional Notes
Original DescriptionThe current expression $substr, and the proposed expression $length (see For example, {$substr: ["\uD834\uDF06", 0, 1]} would be an error (since the second is a continuation byte), but {$cpSubstr: ["\uD834\uDF06", 0, 1]} would be "\uD834\uDF06". Correspondingly, {$length: "\uD834\uDF06"} would be 2, but {$cpLength: "\uD834\uDF06"} would be 1. |
| Comments |
| Comment by Githook User [ 25/Mar/16 ] |
|
Author: {u'username': u'benjaminmurphy', u'name': u'Benjamin Murphy', u'email': u'benjamin_murphy@me.com'}Message: |
| Comment by Benjamin Murphy [ 25/Mar/16 ] |
|
This ticket introduced the $substrCP expression, with syntax as described in the description. It needs documentation, and any drivers that provide aggregation framework helpers should be updated to include this new expression. In addition, this patch deprecated $substr in favor of $substrBytes, which has identical functionality. |
| Comment by Githook User [ 25/Mar/16 ] |
|
Author: {u'username': u'benjaminmurphy', u'name': u'Benjamin Murphy', u'email': u'benjamin_murphy@me.com'}Message: |
| Comment by Charlie Swanson [ 11/Mar/16 ] |
|
After some internal discussion, I've updated the description to match the agree-upon design. |
| Comment by Mathias Stearn [ 16/Feb/16 ] |
|
Your example is a bit confusing because it uses UTF-16 surrogate pairs. It may be simpler to rewrite it using either the UTF-8 bytes or UTF-32 whole code points. I think your $length example would actually return 4 since that is the number of UTF8 bytes it takes to represent that single non-BMP codepoint http://www.charbase.com/1d306-unicode-tetragram-for-centre. |