[SERVER-29466] Cannot query over fields containing invalid UTF-8 Created: 06/Jun/17  Updated: 06/Dec/22  Resolved: 14/Sep/17

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.4.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Bastian Suter Assignee: Backlog - Query Team (Inactive)
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-12204 Buffer::readUTF8String in bson_valida... Closed
Assigned Teams:
Query
Operating System: ALL
Sprint: Query 2017-10-02
Participants:

 Description   

A string search or comparison query like the following fails to return any documents that contains special (unprintable?) characters:

Comparison: { "string": "substring ���콻�� substring" } 
Search: { "string": /.*substring.*/i } 

Example document:

{ 
    "_id" : ObjectId("593708460121722b9463f7a1"), 
    "string" : "substring ���콻�� substring"
}

This causes me to be unable to directly find such documents in my database, which is very annoying. If you wonder why, from time to time such special characters can occur in user-generated data in my DB. This is not a bug and by design / allowed.

The original document is saved using the C driver - if you copy and paste it from here to insert into a database the bug will not show as the unprintable characters seem to become UTF-8 character which are fine (e.g. there are no issues when searching a string with just Chinese in it).



 Comments   
Comment by Ian Whalen (Inactive) [ 14/Sep/17 ]

Closing as Won't Fix given Dave's above comment and that we don't intend to pursue an upstream fix for invalid UTF-8 and pcrecpp.

Comment by David Storch [ 12/Sep/17 ]

Hi bastian,

Apologies for the delay in our response. I looked into this and have some information to report. I think there are two issues at play here. The first is that our standard BSON validation routine, which is invoked whenever the server ingests BSON off the wire, does not validate UTF-8 strings. We have an open ticket tracking the work to add such UTF-8 validation: SERVER-12204.

The second issue has to do with the behavior of regular expression matching. The server code includes the PCRE library as its regular expression engine, and interfaces with this library through a C++ wrapper: see pcrecpp.h. As far as I can tell, we enable UTF-8 when we build PCRE, and the expected behavior is that an error will be thrown when either the subject string or the regex pattern consists of invalid UTF-8. However, the C++ wrapper appears to discard this error when we call pcrecpp::RE::PartialMatch(). As a result, the server always fails to match when applying a regex to an invalid UTF-8 string. We could improve this behavior in one of two ways:

  1. Return an error when attempting to match invalid UTF-8.
  2. Implement the ability for partial matching to work, if the invalid portion of the string is not involved in the match.

Both of these changes appear to require upstream changes in the PCRE library. While the former seems fairly straightforward, I suspect it is unlikely that we would pursue the latter.

I'm copying mark.benvenuto, a MongoDB engineer who knows more about our PCRE integration, in case he has anything to add.

Best,
Dave

Comment by Ian Whalen (Inactive) [ 25/Aug/17 ]

assigning to dave to investigate for triage.

Comment by Ian Whalen (Inactive) [ 03/Aug/17 ]

david.storch can you check on where and how we do UTF-8 validation?

Comment by Mark Agarunov [ 19/Jul/17 ]

Hello bastian,

Thank you for the information. I've now been able to reproduce this behavior using the C driver. I've set the fixVersion to "Needs Triage" for this new feature to be scheduled against our currently planned work. Updates will be posted on this ticket as they happen.

Thanks,
Mark

Comment by Bastian Suter [ 03/Jul/17 ]

Sorry for the delay.

I'm afraid I cannot upload my database for you and unfortunately copying the entire collection also solves the issue, so I'm afraid I can only suggest what I mentioned above: Try to insert a document containing a string such as "substring \xB7\x73\xBC\x57\xB8\xEA\xAE\xC6\xA7\xA8 substring" into a collection using the C driver (mongoc_collection_insert) and then try to find this document using the search query targeting "substring". You should be able to reproduce it 100% like this.

Comment by Mark Agarunov [ 22/Jun/17 ]

Hello bastian,

We still need additional information to diagnose the problem. If this is still an issue for you, would you please provide:

  • The complete log files from mongod
  • If possible, the database that contains the offending characters.

Thanks,
Mark

Comment by Mark Agarunov [ 09/Jun/17 ]

Hello bastian,

Thank you for the quick response. I would like to take a deeper look at this behavior so I'd like to request a couple things. Could you please provide:

  • The complete log files from mongod
  • If possible, the database that contains the offending characters.

I've generated a secure upload portal so that you can send us these files privately.

Thanks,
Mark

Comment by Bastian Suter [ 09/Jun/17 ]

I did some research using raw file logs for the string in question and found the following:

A string containing the bytes "B7 73 BC 57 B8 EA AE C6 A7 A8" was saved using the C driver. The problem with some of those bytes (e.g. B7) is that they are completely invalid in terms of UTF-8. When showing that document in my GUI, it shows as: "substring �s�W��Ƨ�" - the latter part represents these bytes. What happens is that the GUI or web browser simply shows these invalid bytes as the UTF-8 replacement character "�". Therefore, when copying this string from the browser or re-saving the printed document (I tried that), these invalid bytes will turn into valid UTF-8 characters and hence the string can suddenly be found using the aforementioned query. That also means that you can't reproduce the issue using any sort of UI (including console) or by copy&pasting but only using actual code where you can specify such bytes.

So it's definitely an input issue when saving the document on my side, i.e. a string should never contain such invalid characters which the C driver seems to successfully save. I will fix that. The question remains if you consider it a bug that the regex string search fails because of these invalid characters in the string/document? I assume that the regex code fails somewhere because of these chars and hence this should be considered a "feature", i.e. no fix coming?

Thanks again.

Comment by Mark Agarunov [ 09/Jun/17 ]

Hello bastian,

Thank you for the report. Unfortunately I haven't been have to reproduce the behavior you've described. After inserting the document from your example, the document is returned with a find() query:

> db.test.insert({"string" : "substring ���콻�� substring"})
WriteResult({ "nInserted" : 1 })
> db.test.find({"string": /.*substring.*/i})
{ "_id" : ObjectId("593ae9c39c8807f43fdd9180"), "string" : "substring ���콻�� substring" }

If you are still seeing this issue, please append .explain(true) to the find() query and provide the output, for example:

 db.test.find({"string": /.*substring.*/i}).explain(true)

Thanks,
Mark

Comment by Bastian Suter [ 06/Jun/17 ]

Just realized: Please ignore the comparison part as that query naturally only contains printable/UTF-8 characters, hence it will never be equal to the actual string in the document which contains unprintable chars. The critical issue here is that the search query is not working.

Generated at Thu Feb 08 04:20:58 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.