Uploaded image for project: 'Python Driver'
  1. Python Driver
  2. PYTHON-3048

BSON C extensions allow encoding a Regex with invalid utf-8

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Trivial - P5 Trivial - P5
    • 4.2
    • Affects Version/s: None
    • Component/s: None
    • Minor Change

      PyMongo allows inserting invalid utf-8 via a Regex instance. It then fails to decode the resulting document (without overriding unicode_decode_error_handler):

      >>> has_c()
      False
      >>> b'\xed\xbc\xad'.decode('utf-8')
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
      UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte
      >>> from bson import encode, decode
      >>> data = encode({'a':Regex(b'\xed\xbc\xad','')})
      b'\r\x00\x00\x00\x0ba\x00\xed\xbc\xad\x00\x00\x00'
      >>> decode(data)
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/Users/shane/git/mongo-python-driver/bson/__init__.py", line 903, in decode
          return _bson_to_dict(data, codec_options)
      bson.errors.InvalidBSON: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte
      >>> decode(data, codec_options=CodecOptions(unicode_decode_error_handler='replace'))
      {'a': Regex('���', 0)}
      

      The same case without the C extensions does properly raise an error:

      >>> encode({'a':Regex(b'\xed\xbc\xad')})
      Traceback (most recent call last):
        File "/Users/shane/git/mongo-python-driver/bson/__init__.py", line 479, in _make_c_string_check
          _utf_8_decode(string, None, True)
      UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte
      
      During handling of the above exception, another exception occurred:
      
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/Users/shane/git/mongo-python-driver/bson/__init__.py", line 871, in encode
          return _dict_to_bson(document, check_keys, codec_options)
        File "/Users/shane/git/mongo-python-driver/bson/__init__.py", line 808, in _dict_to_bson
          elements.append(_element_to_bson(key, value,
        File "/Users/shane/git/mongo-python-driver/bson/__init__.py", line 794, in _element_to_bson
          return _name_value_to_bson(name, value, check_keys, opts)
        File "/Users/shane/git/mongo-python-driver/bson/__init__.py", line 736, in _name_value_to_bson
          return _ENCODERS[type(value)](name, value, check_keys, opts)
        File "/Users/shane/git/mongo-python-driver/bson/__init__.py", line 610, in _encode_regex
          return b"\x0B" + name + _make_c_string_check(value.pattern) + b"\x00"
        File "/Users/shane/git/mongo-python-driver/bson/__init__.py", line 482, in _make_c_string_check
          raise InvalidStringData("strings in documents must be valid "
      bson.errors.InvalidStringData: strings in documents must be valid UTF-8: b'\xed\xbc\xad'
      

      We should fix the C extensions.

      Edit: I had a copy past error in my PyPy example. PyPy works the same as CPython without C extensions.

            Assignee:
            ben.warner@mongodb.com Ben Warner (Inactive)
            Reporter:
            shane.harvey@mongodb.com Shane Harvey
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: