[CDRIVER-2403] Does libbson implement UTF-8 or CESU-8? Created: 22/Nov/17  Updated: 28/Oct/23  Resolved: 26/Dec/17

Status: Closed
Project: C Driver
Component/s: libbson
Affects Version/s: None
Fix Version/s: 1.10.0

Type: Task Priority: Minor - P4
Reporter: A. Jesse Jiryu Davis Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to CDRIVER-2401 Handle UTF-8 multibyte NIL in bson_ut... Closed
Backwards Compatibility: Fully Compatible

 Description   

In _bson_utf8_get_sequence we allow character lengths up to 6 bytes, which indicates we're parsing the CESU-8 character set, not string UTF-8. Figure out if this is true and if we did it correctly. If so, document it.



 Comments   
Comment by Githook User [ 14/Dec/17 ]

Author:

{'name': 'Xiangyu Yao', 'email': 'xiangyu.yao24@gmail.com', 'username': 'xy24'}

Message: CDRIVER-2403 all bson-utf8 relevant functions conform to RFC-3629 now
Branch: master
https://github.com/mongodb/mongo-c-driver/commit/c9b20923296d28203706a1ef0134b341b860aab5

Comment by Xiangyu Yao (Inactive) [ 07/Dec/17 ]

As the wikipedia points out

In November 2003, UTF-8 was restricted by RFC 3629 to match the constraints of the UTF-16 character encoding: explicitly prohibiting code points corresponding to the high and low surrogate characters removed more than 3% of the three-byte sequences, and ending at U+10FFFF removed more than 48% of the four-byte sequences and all five- and six-byte sequences.

, I am going to change our implementation to 4-byte standard in case it's incompatible with other drivers or the mongodb server.

Comment by Xiangyu Yao (Inactive) [ 07/Dec/17 ]

After communicating with the original author Christian Hergert and doing some experiments on my own, I realized bson-utf8.c indeed implements UTF-8 encode rather than CESU-8.

On the wikipedia page of UTF-8, in the 'Description' section, the diagram shows the UTF-8 encode is in maximum size of 4 bytes while in the 'History' section, a diagram shows the encode can be up to 6 bytes. (FSS-UTF (1992) / UTF-8 (1993)). Here we just chose the 6-byte standard. There isn't any difference except that the maximum unicode supported is from 0x10FFFF to 0x7FFFFFFF.

Generated at Wed Feb 07 21:15:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.