[SERVER-29466] Cannot query over fields containing invalid UTF-8 Created: 06/Jun/17 Updated: 06/Dec/22 Resolved: 14/Sep/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 3.4.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Bastian Suter | Assignee: | Backlog - Query Team (Inactive) |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Query
|
||||||||
| Operating System: | ALL | ||||||||
| Sprint: | Query 2017-10-02 | ||||||||
| Participants: | |||||||||
| Description |
|
A string search or comparison query like the following fails to return any documents that contains special (unprintable?) characters:
Example document:
This causes me to be unable to directly find such documents in my database, which is very annoying. If you wonder why, from time to time such special characters can occur in user-generated data in my DB. This is not a bug and by design / allowed. The original document is saved using the C driver - if you copy and paste it from here to insert into a database the bug will not show as the unprintable characters seem to become UTF-8 character which are fine (e.g. there are no issues when searching a string with just Chinese in it). |
| Comments |
| Comment by Ian Whalen (Inactive) [ 14/Sep/17 ] | |||||
|
Closing as Won't Fix given Dave's above comment and that we don't intend to pursue an upstream fix for invalid UTF-8 and pcrecpp. | |||||
| Comment by David Storch [ 12/Sep/17 ] | |||||
|
Hi bastian, Apologies for the delay in our response. I looked into this and have some information to report. I think there are two issues at play here. The first is that our standard BSON validation routine, which is invoked whenever the server ingests BSON off the wire, does not validate UTF-8 strings. We have an open ticket tracking the work to add such UTF-8 validation: The second issue has to do with the behavior of regular expression matching. The server code includes the PCRE library as its regular expression engine, and interfaces with this library through a C++ wrapper: see pcrecpp.h. As far as I can tell, we enable UTF-8 when we build PCRE, and the expected behavior is that an error will be thrown when either the subject string or the regex pattern consists of invalid UTF-8. However, the C++ wrapper appears to discard this error when we call pcrecpp::RE::PartialMatch(). As a result, the server always fails to match when applying a regex to an invalid UTF-8 string. We could improve this behavior in one of two ways:
Both of these changes appear to require upstream changes in the PCRE library. While the former seems fairly straightforward, I suspect it is unlikely that we would pursue the latter. I'm copying mark.benvenuto, a MongoDB engineer who knows more about our PCRE integration, in case he has anything to add. Best, | |||||
| Comment by Ian Whalen (Inactive) [ 25/Aug/17 ] | |||||
|
assigning to dave to investigate for triage. | |||||
| Comment by Ian Whalen (Inactive) [ 03/Aug/17 ] | |||||
|
david.storch can you check on where and how we do UTF-8 validation? | |||||
| Comment by Mark Agarunov [ 19/Jul/17 ] | |||||
|
Hello bastian, Thank you for the information. I've now been able to reproduce this behavior using the C driver. I've set the fixVersion to "Needs Triage" for this new feature to be scheduled against our currently planned work. Updates will be posted on this ticket as they happen. Thanks, | |||||
| Comment by Bastian Suter [ 03/Jul/17 ] | |||||
|
Sorry for the delay. I'm afraid I cannot upload my database for you and unfortunately copying the entire collection also solves the issue, so I'm afraid I can only suggest what I mentioned above: Try to insert a document containing a string such as "substring \xB7\x73\xBC\x57\xB8\xEA\xAE\xC6\xA7\xA8 substring" into a collection using the C driver (mongoc_collection_insert) and then try to find this document using the search query targeting "substring". You should be able to reproduce it 100% like this. | |||||
| Comment by Mark Agarunov [ 22/Jun/17 ] | |||||
|
Hello bastian, We still need additional information to diagnose the problem. If this is still an issue for you, would you please provide:
Thanks, | |||||
| Comment by Mark Agarunov [ 09/Jun/17 ] | |||||
|
Hello bastian, Thank you for the quick response. I would like to take a deeper look at this behavior so I'd like to request a couple things. Could you please provide:
I've generated a secure upload portal so that you can send us these files privately. Thanks, | |||||
| Comment by Bastian Suter [ 09/Jun/17 ] | |||||
|
I did some research using raw file logs for the string in question and found the following: A string containing the bytes "B7 73 BC 57 B8 EA AE C6 A7 A8" was saved using the C driver. The problem with some of those bytes (e.g. B7) is that they are completely invalid in terms of UTF-8. When showing that document in my GUI, it shows as: "substring �s�W��Ƨ�" - the latter part represents these bytes. What happens is that the GUI or web browser simply shows these invalid bytes as the UTF-8 replacement character "�". Therefore, when copying this string from the browser or re-saving the printed document (I tried that), these invalid bytes will turn into valid UTF-8 characters and hence the string can suddenly be found using the aforementioned query. That also means that you can't reproduce the issue using any sort of UI (including console) or by copy&pasting but only using actual code where you can specify such bytes. So it's definitely an input issue when saving the document on my side, i.e. a string should never contain such invalid characters which the C driver seems to successfully save. I will fix that. The question remains if you consider it a bug that the regex string search fails because of these invalid characters in the string/document? I assume that the regex code fails somewhere because of these chars and hence this should be considered a "feature", i.e. no fix coming? Thanks again. | |||||
| Comment by Mark Agarunov [ 09/Jun/17 ] | |||||
|
Hello bastian, Thank you for the report. Unfortunately I haven't been have to reproduce the behavior you've described. After inserting the document from your example, the document is returned with a find() query:
If you are still seeing this issue, please append .explain(true) to the find() query and provide the output, for example:
Thanks, | |||||
| Comment by Bastian Suter [ 06/Jun/17 ] | |||||
|
Just realized: Please ignore the comparison part as that query naturally only contains printable/UTF-8 characters, hence it will never be equal to the actual string in the document which contains unprintable chars. The critical issue here is that the search query is not working. |