Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Fixed
Priority: Major - P3
Fix Version/s: bson-4.6.0
Affects Version/s: None
Component/s: BSON
Labels:
None

Epic Link:
Reliability Improvements
Confidence Status:
None

Backwards Compatibility:
Minor Change

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Link:
None
Goal Name(s):
None

Currently the BSON extension behaves differently when put_string is given something other than a valid UTF-8 byte sequence in a string labeled with UTF-8 encoding:

In MRI, it is possible to construct strings in utf-8 encoding which are not valid utf-8. These strings are rejected by the bson C extension.
In JRuby, it is not possible to construct strings in utf-8 encoding which are not valid utf-8.
In MRI, writing a string which is not labeled with utf-8 encoding, but which happens to contain valid utf-8 sequences, treats the string as if it was labeled with utf-8 encoding and writes it verbatim to the bson buffer.
In JRuby, writing a string which is not labeled with utf-8 encoding first converts it to utf-8, seemingly treating each byte in the original string as a code point. This mutates the input silently during serialization.
In MRI, when a string is given in an encoding other than utf-8, even if the byte sequence in the string is valid in the claimed encoding, and the string can be encoded to utf-8, writing it to bson buffer yields an error saying the string is not valid utf-8.
In JRuby, the same input is encoded to utf-8 and written to the bson buffer.

Proposed new behavior:

Strings which are not already in utf-8 encoding are first (attempted to be) encoded in utf-8. This could fail, propagating Encoding::UndefinedConversionError to the application.
Then, the string is checked to contain valid utf-8 byte sequences.
Finally the string is written to the byte buffer.

For MRI, this change means strings in non-utf-8 encodings which contain valid data would be serialized after conversion to utf-8. Applications giving mislabeled strings to the driver (i.e. utf-8 data but string not marked as having utf-8 encoding) will need to set the encoding correctly.

For JRuby, this change means the bson extension will no longer mutate data when it is not valid utf-8. Sometimes the data will be rejected which was previously silently mutated.

is related to

RUBY-1977 Document and repair edge cases in ByteBuffer

Closed

links to

bson-ruby #139: RUBY-1978 Consistently handle data not in UTF-8 when writing strings/symbols

Assignee:: Oleg Pudeyev (Inactive)
Reporter:: Oleg Pudeyev (Inactive)
Votes:: 0 Vote for this issue
Watchers:: 1 Start watching this issue

Created:: Oct 23 2019 07:09:58 PM UTC
Updated:: Oct 28 2023 11:12:50 AM UTC
Resolved:: Oct 24 2019 07:07:05 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates