[SERVER-24007] Server can return invalid UTF8 for error messages due to truncation in the middle of a code point Created: 02/May/16  Updated: 27/Dec/23

Status: Backlog
Project: Core Server
Component/s: Internal Code
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Sunook Choi Assignee: Backlog - Query Execution
Resolution: Unresolved Votes: 1
Labels: query-44-grooming
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

ubuntu 14.04 / AWS EC2


Issue Links:
Duplicate
is duplicated by CDRIVER-2453 Invalid bson returned in bulk operati... Closed
is duplicated by SERVER-55442 Server returns invalid utf-8 in dupli... Closed
Problem/Incident
causes RUBY-2560 EncodingError raised when server retu... Backlog
causes DRIVERS-1936 Drivers should have option to disable... Backlog
Related
related to PYTHON-1682 Unicode errors from server are improp... Closed
related to RUST-886 Use Lossy UTF8 Decoding when decoding... Closed
related to RUST-648 Decoding a a document with lossy utf8... Closed
is related to NODE-3627 Getting "Invalid UTF-8 string in BSON... Closed
is related to SERVER-26050 Unique key violation for index with a... Backlog
is related to DRIVERS-2008 Default to lossy/replacement behavior... Backlog
is related to PYTHON-1090 Use 'replace' error handler when deco... Closed
Assigned Teams:
Query Execution
Operating System: ALL
Sprint: Platforms 15 (06/03/16)
Participants:
Case:

 Description   

with unique option index + 'korean' content
driver occur error when insert duplicate content

see below


cswcsy@niklane-Samsung-Ubuntu:~/crawlers/CrawlerPlatform/utils$ python mongo_test.py
<pymongo.results.InsertOneResult object at 0x7fba9819e500>
cswcsy@niklane-Samsung-Ubuntu:~/crawlers/CrawlerPlatform/utils$ python mongo_test.py
Traceback (most recent call last):
File "mongo_test.py", line 24, in <module>
result = col.insert_one(script)
File "/usr/local/lib/python2.7/dist-packages/pymongo/collection.py", line 625, in insert_one
bypass_doc_val=bypass_document_validation),
File "/usr/local/lib/python2.7/dist-packages/pymongo/collection.py", line 530, in _insert
check_keys, manipulate, write_concern, op_id, bypass_doc_val)
File "/usr/local/lib/python2.7/dist-packages/pymongo/collection.py", line 512, in _insert_one
check_keys=check_keys)
File "/usr/local/lib/python2.7/dist-packages/pymongo/pool.py", line 218, in command
self._raise_connection_failure(error)
File "/usr/local/lib/python2.7/dist-packages/pymongo/pool.py", line 346, in _raise_connection_failure
raise error
bson.errors.InvalidBSON: 'utf8' codec can't decode byte 0xeb in position 230: invalid continuation byte
cswcsy@niklane-Samsung-Ubuntu:~/crawlers/CrawlerPlatform/utils$


like above, it reproduced 100% when i insert twice time
in unique korean field.
it didn't reproduce when i use another content(korean)

here is my test code


        • coding: utf-8 *
          from pprint import pprint
          from pymongo import ReplaceOne
          from pymongo import InsertOne
          import pymongo
          from pymongo import MongoClient
          from utils.mongomanager import MongoManager
          from pymongo.errors import BulkWriteError

mongo = MongoClient('localhost', 27017)
db = mongo['bigdata']
col = db['test']

script =

{'brand_name': u'\ub77c\uc628', 'category0': u'\uc0dd\ud65c/\uac74\uac15', 'category1': u'\uacf5\uad6c', 'category2': u'\ubaa9\uacf5\uacf5\uad6c', 'category3': u'\ub300\ud328', 'entity': [], 'price': 9300, 'title': u'\uad6c \uad6d\uc0b0 \ub300\ud328 \uc190\ub300\ud328 \ubaa9\uacf5\uacf5\uad6c \ubbf8\ub2c8\ub300\ud328 \ubaa8\uc11c\ub9ac\ub300\ud328 \ub300\ud328\ub0a0 \ubaa9\uacf5\uad6c \uc804\ub3d9\ub300\ud328 \ubaa9\uc218\uacf5\uad6c \ubaa9\uacf5\uc608 \ud648\ub300\ud328 DIY\uacf5\uad6c \ud3c9\uba74 \ub2e4\ub4ec\uae30'}

result = col.insert_one(script)
pprint(result)


if need something more information or has some solution with this issue, plz reply me.

thanks a lot



 Comments   
Comment by Bruce Lucas (Inactive) [ 18/Jun/18 ]

I raised the priority of this ticket to P3 because the invalid UTF-8 can cause client code that tries to parse the bson to fail.

Comment by Sunook Choi [ 03/Jun/16 ]

Thanks a lot!

Comment by Bernie Hackett [ 01/Jun/16 ]

Hi cswcsy, we've committed a workaround to PyMongo master, slated for the next PyMongo release, 3.3. See PYTHON-1090 for details.

Comment by Bernie Hackett [ 03/May/16 ]

Unfortunately, CodecOptions isn't applied to server write responses in the bulk write API. It also isn't applied for non-bulk write responses when using the legacy write operations (MongoDB 2.4). I've opened PYTHON-1090 to add a workaround. We'll likely just always use the replace error handler for server write responses in the future (but not query responses).

Comment by Sunook Choi [ 03/May/16 ]

@Bernie Hackett
when i use bulk_write in pymongo, your W/A(set error handler to replace) didn't work.
of course it works when i use insert_one.
if you are not busy, can you check it?

#-*- coding: utf-8 -*-
from pprint import pprint
from pymongo import ReplaceOne
from pymongo import InsertOne
import pymongo
from pymongo import MongoClient
from utils.mongomanager import MongoManager
from pymongo.errors import BulkWriteError
from bson.codec_options import CodecOptions
 
mongo = MongoClient('localhost', 27017)
db = mongo['bigdata']
col = db['test'].with_options(codec_options=CodecOptions(unicode_decode_error_handler='replace'))
 
bulk = []
 
script ={'brand_name': u'\ub77c\uc628',
 'category0': u'\uc0dd\ud65c/\uac74\uac15',
 'category1': u'\uacf5\uad6c',
 'category2': u'\ubaa9\uacf5\uacf5\uad6c',
 'category3': u'\ub300\ud328',
 'entity': [],
 'price': 9300,
 'title': '구 국산 대패 손대패 목공공구 미니대패 모서리대패 대패날 목공구 전동대패 목수공구 목공예 홈대패 DIY공구 평면 다듬기'}
 
bulk.append(InsertOne(script))
 
result = col.bulk_write(bulk, ordered=False, bypass_document_validation=True)
pprint(result)

Comment by Sunook Choi [ 03/May/16 ]

Thanks for W/A !

Comment by Bernie Hackett [ 02/May/16 ]

You can work around this in PyMongo by changing the error handler for unicode decode errors:

>>> from bson.codec_options import CodecOptions
>>> coll = c.test.test.with_options(codec_options=CodecOptions(unicode_decode_error_handler='replace'))
>>> coll.insert_one(doc)
<pymongo.results.InsertOneResult object at 0x7f6e6b05fdc0>
>>> del doc['_id']
>>> try:
...     coll.insert_one(doc)
... except Exception as exc:
...     exc.details
... 
{u'index': 0, u'code': 11000, u'errmsg': u'E11000 duplicate key error collection: test.test index: title_1 dup key: { : "\uad6c \uad6d\uc0b0 \ub300\ud328 \uc190\ub300\ud328 \ubaa9\uacf5\uacf5\uad6c \ubbf8\ub2c8\ub300\ud328 \ubaa8\uc11c\ub9ac\ub300\ud328 \ub300\ud328\ub0a0 \ubaa9\uacf5\uad6c \uc804\ub3d9\ub300\ud328 \ubaa9\uc218\uacf5\uad6c \ubaa9\uacf5\uc608 \ud648\ub300\ud328 DIY\uacf5\uad6c \ud3c9\ufffd..." }'}

Note that '\ud3c9\uba74' is being replaced with '\ud3c9\ufffd...' ('\ufffd' being the unicode replacement character). If instead we use the 'ignore' directive we get '\ud3c9...'

I'm guessing this is just an unfortunate choice by the server of byte count to truncate the key value. The server appears to be truncating part of a code point. This can probably be fixed by counting characters, rather than bytes, when deciding where to truncate the string.

Comment by Bernie Hackett [ 02/May/16 ]

I've tested this back to MongoDB 2.4. It appears this issue has always existed. My guess is the server is creating mojibake for the duplicate key error message.

PyMongo can query for and display the document without issue:

>>> c.test.test.find_one()
{u'category1': u'\uacf5\uad6c', u'category0': u'\uc0dd\ud65c/\uac74\uac15', u'category3': u'\ub300\ud328', u'category2': u'\ubaa9\uacf5\uacf5\uad6c', u'title': u'\uad6c \uad6d\uc0b0 \ub300\ud328 \uc190\ub300\ud328 \ubaa9\uacf5\uacf5\uad6c \ubbf8\ub2c8\ub300\ud328 \ubaa8\uc11c\ub9ac\ub300\ud328 \ub300\ud328\ub0a0 \ubaa9\uacf5\uad6c \uc804\ub3d9\ub300\ud328 \ubaa9\uc218\uacf5\uad6c \ubaa9\uacf5\uc608 \ud648\ub300\ud328 DIY\uacf5\uad6c \ud3c9\uba74 \ub2e4\ub4ec\uae30', u'price': 9300, u'brand_name': u'\ub77c\uc628', u'entity': [], u'_id': ObjectId('57279630fba52269ef009e0d')}

Comment by Sunook Choi [ 02/May/16 ]

thanks for reply!
i know korean title index didn't good choice for unique index.(for performance, etc..)
so i need to find alternative way.

anyway, i hope so this issue analyzed too!
thanks again

Comment by Bernie Hackett [ 02/May/16 ]

This is very strange. The problem is that MongoDB is returning duplicate key error because a document matching the unique index already exists. The message that MongoDB is returning includes the value that caused the error, but the server seems to have encoded it incorrectly, so python can't decode it to utf-8.

In the server logs we have:

2016-05-02T07:41:59.453-0700 D WRITE    [conn3]  Caught Assertion in query, continuing  :: caused by :: E11000 duplicate key error collection: test.test index: title_1 dup key: { : "구 국산 대패 손대패 목공공구 미니대패 모서리대패 대패날 목공구 전동대패 목수공구 목공예 홈대패 DIY공구 평�..." }
2016-05-02T07:41:59.453-0700 I COMMAND  [conn3] command test.test command: insert { insert: "test", ordered: true, documents: [ { category1: "공구", category0: "생활/건강", category3: "대패", category2: "목공공구", title: "구 국산 대패 손대패 목공공구 미니대패 모서리대패 대패날 목공구 전동대패 목수공구 목공예 홈대패 DIY공구 평�...", price: 9300, _id: ObjectId('57276737fa5bd81c3ca2f5d8'), brand_name: "라온", entity: [] } ] } ninserted:0 keyUpdates:0 writeConflicts:0 exception: E11000 duplicate key error collection: test.test index: title_1 dup key: { : "구 국산 대패 손대패 목공공구 미니대패 모서리대패 대패날 목공구 전동대패 목수공구 목공예 홈대패 DIY공구 평�..." } code:11000 numYields:0 reslen:334 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { w: 1 } } } protocol:op_query 0ms

This seems like it must be a bug in the server, but I'll have to do some research. Thanks for reporting this!

Comment by Sunook Choi [ 02/May/16 ]

+
i used below for create index

db.test.createIndex({title:1},{unique:true})

Generated at Thu Feb 08 04:05:06 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.