[SERVER-3538] UTF8 null character \u0000 in the middle of a string is not handled correctly Created: 05/Aug/11  Updated: 05/Jan/14  Resolved: 05/Aug/11

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: 1.8.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Charles-Henri d'Adhémar Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Suse Linux Entreprise Server x86_64 10.3


Issue Links:
Duplicate
duplicates SERVER-1300 use memcmp, not strcmp for comparing ... Closed
Related
Operating System: ALL
Participants:

 Description   

Hello,

The valid UTF8 character \u0000 is not handled properly : the string is cut at this character. Mongo is probably interpreting it as a string terminating character.
Example :

MongoDB shell version: 1.8.0
connecting to: test
> db.test.save(

{'text': 'foo\u0000bar'}

)
> db.test.findOne()

{ "_id" : ObjectId("4e3bbbaa4e496a38200a6f81"), "text" : "foo" }

> db.test.findOne(

{text: /foo/}

)

{ "_id" : ObjectId("4e3bbbaa4e496a38200a6f81"), "text" : "foo" }

> db.test.findOne(

{text: /bar/}

)
null

We use Mongo to log errors from various servers. We do not have any control on the string characters incoming and we have no workaround for this issue so far.
Thank you very much in advance for your feedback on this issue.

Cheers,
CH.



 Comments   
Comment by Eliot Horowitz (Inactive) [ 05/Aug/11 ]

See SERVER-1300

Comment by Charles-Henri d'Adhémar [ 05/Aug/11 ]

Here are some more information :

In Python this case is handle correctly :

In [1]: import re

In [2]: text = u'foo\u0000bar'

In [3]: re.search('foo', text)
Out[3]: <_sre.SRE_Match object at 0x2ba230d7bcc8>

In [4]: re.search('bar', text)
Out[4]: <_sre.SRE_Match object at 0x2ba230d7bd30>

In production we use the pymongo driver : the string 'foo\u0000bar' is correctly saved in the DB and correctly retrieved by either the pymongo API or the interactive javascript shell. But a regex search on words after the '\u0000' character fails in either pymongo or interactive shell.

The issue might come from several places :

PCRE lib issue ?
UTF-8 vs UTF-32 used by Python on some Linux distro ?

The issue is not as simple as "in UTF8 the \u0000 is the string terminating character so this is working as designed".

Do not hesitate to ask for more information.
Cheers,
CH

Generated at Thu Feb 08 03:03:20 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.