Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-6801

aggregation $substr expression can output invalid UTF8

    • Major Change
    • ALL
    • Quint Iteration 4

      $substr will be changed to error out if splitting in the middle of a multi-byte code point. In particular, it will error out if the first byte is a continuation byte or the last byte is not either a single byte code point or the final byte of a multi-byte code point. The implementation may assume that the input string is valid uft8.

      Original Title: aggregation string functions are not encoding (utf8) aware

      Original Description:
      The aggregation string manipulation functions are not utf8 aware. For example, the offset and length parameters passed to $substr represent bytes not characters. Some parameter choices can generate invalid utf8 in the aggregation result and cause the shell to print messages about invalid utf8.

      We might want to prevent the aggregation framework from producing invalid utf8. Potentially we could make $substr operate on utf8 characters rather than bytes.

      Test

      c = db.c;
      c.drop();
      c.save( {} );
      
      result = c.aggregate( { $project:{ _id:0, x:{ $substr:[ '\u0080', 0, 1 ] } } } );
      printjson( result );
      

      Output

      Aaron-Staples-MacBook-Pro:mongo3 aaron$ ./mongo test.js
      MongoDB shell version: 2.3.0-pre-
      connecting to: test
      Sun Aug 19 23:07:41 decode failed. probably invalid utf-8 string [?]
      Sun Aug 19 23:07:41 	 why: InternalError: buffer too small
      Sun Aug 19 23:07:41 InternalError: buffer too small src/mongo/shell/utils.js:1018
      failed to load: test.js
      

            Assignee:
            charlie.swanson@mongodb.com Charlie Swanson
            Reporter:
            aaron Aaron Staple
            Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: