[SERVER-24007] Server can return invalid UTF8 for error messages due to truncation in the middle of a code point Created: 02/May/16 Updated: 27/Dec/23 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Internal Code |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Sunook Choi | Assignee: | Backlog - Query Execution |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | query-44-grooming | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
ubuntu 14.04 / AWS EC2 |
||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Assigned Teams: |
Query Execution
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Platforms 15 (06/03/16) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
with unique option index + 'korean' content see below cswcsy@niklane-Samsung-Ubuntu:~/crawlers/CrawlerPlatform/utils$ python mongo_test.py like above, it reproduced 100% when i insert twice time here is my test code
mongo = MongoClient('localhost', 27017) script = {'brand_name': u'\ub77c\uc628', 'category0': u'\uc0dd\ud65c/\uac74\uac15', 'category1': u'\uacf5\uad6c', 'category2': u'\ubaa9\uacf5\uacf5\uad6c', 'category3': u'\ub300\ud328', 'entity': [], 'price': 9300, 'title': u'\uad6c \uad6d\uc0b0 \ub300\ud328 \uc190\ub300\ud328 \ubaa9\uacf5\uacf5\uad6c \ubbf8\ub2c8\ub300\ud328 \ubaa8\uc11c\ub9ac\ub300\ud328 \ub300\ud328\ub0a0 \ubaa9\uacf5\uad6c \uc804\ub3d9\ub300\ud328 \ubaa9\uc218\uacf5\uad6c \ubaa9\uacf5\uc608 \ud648\ub300\ud328 DIY\uacf5\uad6c \ud3c9\uba74 \ub2e4\ub4ec\uae30'}result = col.insert_one(script) if need something more information or has some solution with this issue, plz reply me. thanks a lot |
| Comments |
| Comment by Bruce Lucas (Inactive) [ 18/Jun/18 ] | |||||||||||||||||||||||||||||
|
I raised the priority of this ticket to P3 because the invalid UTF-8 can cause client code that tries to parse the bson to fail. | |||||||||||||||||||||||||||||
| Comment by Sunook Choi [ 03/Jun/16 ] | |||||||||||||||||||||||||||||
|
Thanks a lot! | |||||||||||||||||||||||||||||
| Comment by Bernie Hackett [ 01/Jun/16 ] | |||||||||||||||||||||||||||||
|
Hi cswcsy, we've committed a workaround to PyMongo master, slated for the next PyMongo release, 3.3. See | |||||||||||||||||||||||||||||
| Comment by Bernie Hackett [ 03/May/16 ] | |||||||||||||||||||||||||||||
|
Unfortunately, CodecOptions isn't applied to server write responses in the bulk write API. It also isn't applied for non-bulk write responses when using the legacy write operations (MongoDB 2.4). I've opened | |||||||||||||||||||||||||||||
| Comment by Sunook Choi [ 03/May/16 ] | |||||||||||||||||||||||||||||
|
@Bernie Hackett
| |||||||||||||||||||||||||||||
| Comment by Sunook Choi [ 03/May/16 ] | |||||||||||||||||||||||||||||
|
Thanks for W/A ! | |||||||||||||||||||||||||||||
| Comment by Bernie Hackett [ 02/May/16 ] | |||||||||||||||||||||||||||||
|
You can work around this in PyMongo by changing the error handler for unicode decode errors:
Note that '\ud3c9\uba74' is being replaced with '\ud3c9\ufffd...' ('\ufffd' being the unicode replacement character). If instead we use the 'ignore' directive we get '\ud3c9...' I'm guessing this is just an unfortunate choice by the server of byte count to truncate the key value. The server appears to be truncating part of a code point. This can probably be fixed by counting characters, rather than bytes, when deciding where to truncate the string. | |||||||||||||||||||||||||||||
| Comment by Bernie Hackett [ 02/May/16 ] | |||||||||||||||||||||||||||||
|
I've tested this back to MongoDB 2.4. It appears this issue has always existed. My guess is the server is creating mojibake for the duplicate key error message. PyMongo can query for and display the document without issue:
| |||||||||||||||||||||||||||||
| Comment by Sunook Choi [ 02/May/16 ] | |||||||||||||||||||||||||||||
|
thanks for reply! anyway, i hope so this issue analyzed too! | |||||||||||||||||||||||||||||
| Comment by Bernie Hackett [ 02/May/16 ] | |||||||||||||||||||||||||||||
|
This is very strange. The problem is that MongoDB is returning duplicate key error because a document matching the unique index already exists. The message that MongoDB is returning includes the value that caused the error, but the server seems to have encoded it incorrectly, so python can't decode it to utf-8. In the server logs we have:
This seems like it must be a bug in the server, but I'll have to do some research. Thanks for reporting this! | |||||||||||||||||||||||||||||
| Comment by Sunook Choi [ 02/May/16 ] | |||||||||||||||||||||||||||||
|
+
|