-
Type:
Task
-
Resolution: Done
-
Priority:
Minor - P4
-
None
-
Affects Version/s: 2.7
-
Component/s: None
-
None
-
Environment:All
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Hi there,
I'm currently experiencing some issues with a MongoDB collection containing invalid UTF-8 strings. My python code enumerating the collection with a cursor is currently throwing UnicodeDecodeError.
File "build/bdist.macosx-10.9-x86_64/egg/mongo_connector/oplog_manager.py", line 354, in docs_to_dump for doc in cursor: File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pymongo/cursor.py", line 904, in next if len(self.__data) or self._refresh(): File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pymongo/cursor.py", line 865, in _refresh limit, self.__id)) File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pymongo/cursor.py", line 800, in __send_message self.__uuid_subtype) File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pymongo/helpers.py", line 107, in _unpack_response as_class, tz_aware, uuid_subtype) UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 11: invalid continuation byte
After a few minutes of investigation, it seems that it's related to the "strict" mode of all UTF8Decoding performed in the C layer of the BSON library (https://github.com/mongodb/mongo-python-driver/blob/master/bson/_cbsonmodule.c). Any chance "ignore" could be used instead? For example the mongoexport tool skips the failing characters.
Please let me know what you think.