-
Type: Bug
-
Resolution: Fixed
-
Priority: Trivial - P5
-
Affects Version/s: None
-
Component/s: None
-
Minor Change
PyMongo allows inserting invalid utf-8 via a Regex instance. It then fails to decode the resulting document (without overriding unicode_decode_error_handler):
>>> has_c() False >>> b'\xed\xbc\xad'.decode('utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte >>> from bson import encode, decode >>> data = encode({'a':Regex(b'\xed\xbc\xad','')}) b'\r\x00\x00\x00\x0ba\x00\xed\xbc\xad\x00\x00\x00' >>> decode(data) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/shane/git/mongo-python-driver/bson/__init__.py", line 903, in decode return _bson_to_dict(data, codec_options) bson.errors.InvalidBSON: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte >>> decode(data, codec_options=CodecOptions(unicode_decode_error_handler='replace')) {'a': Regex('���', 0)}
The same case without the C extensions does properly raise an error:
>>> encode({'a':Regex(b'\xed\xbc\xad')}) Traceback (most recent call last): File "/Users/shane/git/mongo-python-driver/bson/__init__.py", line 479, in _make_c_string_check _utf_8_decode(string, None, True) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte During handling of the above exception, another exception occurred: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/shane/git/mongo-python-driver/bson/__init__.py", line 871, in encode return _dict_to_bson(document, check_keys, codec_options) File "/Users/shane/git/mongo-python-driver/bson/__init__.py", line 808, in _dict_to_bson elements.append(_element_to_bson(key, value, File "/Users/shane/git/mongo-python-driver/bson/__init__.py", line 794, in _element_to_bson return _name_value_to_bson(name, value, check_keys, opts) File "/Users/shane/git/mongo-python-driver/bson/__init__.py", line 736, in _name_value_to_bson return _ENCODERS[type(value)](name, value, check_keys, opts) File "/Users/shane/git/mongo-python-driver/bson/__init__.py", line 610, in _encode_regex return b"\x0B" + name + _make_c_string_check(value.pattern) + b"\x00" File "/Users/shane/git/mongo-python-driver/bson/__init__.py", line 482, in _make_c_string_check raise InvalidStringData("strings in documents must be valid " bson.errors.InvalidStringData: strings in documents must be valid UTF-8: b'\xed\xbc\xad'
We should fix the C extensions.
Edit: I had a copy past error in my PyPy example. PyPy works the same as CPython without C extensions.