[SERVER-325] Correctly handle \u escapes using UTF16 surrogate pairs for chars outside of BMP Created: 30/Sep/09  Updated: 06/Dec/22  Resolved: 14/Mar/22

Status: Closed
Project: Core Server
Component/s: Tools
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor - P4
Reporter: Mathias Stearn Assignee: Backlog - Storage Execution Team
Resolution: Won't Do Votes: 0
Labels: neweng
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Storage Execution
Participants:

 Description   

http://en.wikipedia.org/wiki/UTF-16/UCS-2#Encoding_of_characters_outside_the_BMP

From rfc 4627:
To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a twelve-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".



 Comments   
Comment by Mathias Stearn [ 02/May/16 ]

The repro is in the description: "\uD834\uDD1E". That is (unfortunately) the correct way to encode U+1D11E in json. Do we parse that correctly as one 4-byte character or incorrectly as two 3-byte characters? Judging by this code, we still handle this incorrectly:

https://github.com/mongodb/mongo/blob/177d955e1e47535141edfb6205ba8fd7004b2aa2/src/mongo/bson/json.cpp#L1144-L1161

Generated at Thu Feb 08 02:53:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.