[CDRIVER-2403] Does libbson implement UTF-8 or CESU-8? Created: 22/Nov/17 Updated: 28/Oct/23 Resolved: 26/Dec/17 |
|
| Status: | Closed |
| Project: | C Driver |
| Component/s: | libbson |
| Affects Version/s: | None |
| Fix Version/s: | 1.10.0 |
| Type: | Task | Priority: | Minor - P4 |
| Reporter: | A. Jesse Jiryu Davis | Assignee: | Unassigned |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Description |
|
In _bson_utf8_get_sequence we allow character lengths up to 6 bytes, which indicates we're parsing the CESU-8 character set, not string UTF-8. Figure out if this is true and if we did it correctly. If so, document it. |
| Comments |
| Comment by Githook User [ 14/Dec/17 ] |
|
Author: {'name': 'Xiangyu Yao', 'email': 'xiangyu.yao24@gmail.com', 'username': 'xy24'}Message: |
| Comment by Xiangyu Yao (Inactive) [ 07/Dec/17 ] |
|
As the wikipedia points out
, I am going to change our implementation to 4-byte standard in case it's incompatible with other drivers or the mongodb server. |
| Comment by Xiangyu Yao (Inactive) [ 07/Dec/17 ] |
|
After communicating with the original author Christian Hergert and doing some experiments on my own, I realized bson-utf8.c indeed implements UTF-8 encode rather than CESU-8. On the wikipedia page of UTF-8, in the 'Description' section, the diagram shows the UTF-8 encode is in maximum size of 4 bytes while in the 'History' section, a diagram shows the encode can be up to 6 bytes. (FSS-UTF (1992) / UTF-8 (1993)). Here we just chose the 6-byte standard. There isn't any difference except that the maximum unicode supported is from 0x10FFFF to 0x7FFFFFFF. |