Uploaded image for project: 'Ruby Driver'
  1. Ruby Driver
  2. RUBY-1978

Consistently handle data not in UTF-8 when writing strings/symbols

    • Type: Icon: Improvement Improvement
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • bson-4.6.0
    • Affects Version/s: None
    • Component/s: BSON
    • Labels:
      None
    • Minor Change

      Currently the BSON extension behaves differently when put_string is given something other than a valid UTF-8 byte sequence in a string labeled with UTF-8 encoding:

      • In MRI, it is possible to construct strings in utf-8 encoding which are not valid utf-8. These strings are rejected by the bson C extension.
      • In JRuby, it is not possible to construct strings in utf-8 encoding which are not valid utf-8.
      • In MRI, writing a string which is not labeled with utf-8 encoding, but which happens to contain valid utf-8 sequences, treats the string as if it was labeled with utf-8 encoding and writes it verbatim to the bson buffer.
      • In JRuby, writing a string which is not labeled with utf-8 encoding first converts it to utf-8, seemingly treating each byte in the original string as a code point. This mutates the input silently during serialization.
      • In MRI, when a string is given in an encoding other than utf-8, even if the byte sequence in the string is valid in the claimed encoding, and the string can be encoded to utf-8, writing it to bson buffer yields an error saying the string is not valid utf-8.
      • In JRuby, the same input is encoded to utf-8 and written to the bson buffer.

      Proposed new behavior:

      • Strings which are not already in utf-8 encoding are first (attempted to be) encoded in utf-8. This could fail, propagating Encoding::UndefinedConversionError to the application.
      • Then, the string is checked to contain valid utf-8 byte sequences.
      • Finally the string is written to the byte buffer.

      For MRI, this change means strings in non-utf-8 encodings which contain valid data would be serialized after conversion to utf-8. Applications giving mislabeled strings to the driver (i.e. utf-8 data but string not marked as having utf-8 encoding) will need to set the encoding correctly.

      For JRuby, this change means the bson extension will no longer mutate data when it is not valid utf-8. Sometimes the data will be rejected which was previously silently mutated.

            Assignee:
            oleg.pudeyev@mongodb.com Oleg Pudeyev (Inactive)
            Reporter:
            oleg.pudeyev@mongodb.com Oleg Pudeyev (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: