[SERVER-6801] aggregation $substr expression can output invalid UTF8 Created: 20/Aug/12  Updated: 28/Oct/15  Resolved: 22/May/15

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: 3.1.4

Type: Bug Priority: Major - P3
Reporter: Aaron Staple Assignee: Charlie Swanson
Resolution: Done Votes: 1
Labels: UT, neweng
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Major Change
Operating System: ALL
Sprint: Quint Iteration 4
Participants:

 Description   

$substr will be changed to error out if splitting in the middle of a multi-byte code point. In particular, it will error out if the first byte is a continuation byte or the last byte is not either a single byte code point or the final byte of a multi-byte code point. The implementation may assume that the input string is valid uft8.

Original Title: aggregation string functions are not encoding (utf8) aware

Original Description:
The aggregation string manipulation functions are not utf8 aware. For example, the offset and length parameters passed to $substr represent bytes not characters. Some parameter choices can generate invalid utf8 in the aggregation result and cause the shell to print messages about invalid utf8.

We might want to prevent the aggregation framework from producing invalid utf8. Potentially we could make $substr operate on utf8 characters rather than bytes.

Test

c = db.c;
c.drop();
c.save( {} );
 
result = c.aggregate( { $project:{ _id:0, x:{ $substr:[ '\u0080', 0, 1 ] } } } );
printjson( result );

Output

Aaron-Staples-MacBook-Pro:mongo3 aaron$ ./mongo test.js
MongoDB shell version: 2.3.0-pre-
connecting to: test
Sun Aug 19 23:07:41 decode failed. probably invalid utf-8 string [?]
Sun Aug 19 23:07:41 	 why: InternalError: buffer too small
Sun Aug 19 23:07:41 InternalError: buffer too small src/mongo/shell/utils.js:1018
failed to load: test.js



 Comments   
Comment by Githook User [ 22/May/15 ]

Author:

{u'username': u'cswanson310', u'name': u'Charlie Swanson', u'email': u'charlie.swanson@mongodb.com'}

Message: SERVER-6801: Error when aggregation's $substr expression results in invalid UTF-8
Branch: master
https://github.com/mongodb/mongo/commit/bc45142fd9a2739484fa586ff7263318e63059b8

Comment by Mathias Stearn [ 01/Apr/13 ]

This should be easy for a neweng once we add some standard UTF8 helper functions.

Comment by auto [ 20/Aug/12 ]

Author:

{u'date': u'2012-08-20T13:48:53-07:00', u'email': u'aaron@10gen.com', u'name': u'astaple'}

Message: SERVER-6801 Document lack of character encoding support in $substr.
Branch: master
https://github.com/mongodb/docs/commit/c603096fd272bfbef239ec1e8f6171323249dd18

Generated at Thu Feb 08 03:12:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.