[JAVA-1662] handle control character in Unicode format Created: 25/Feb/15  Updated: 09/Mar/17  Resolved: 09/Mar/17

Status: Closed
Project: Java Driver
Component/s: JSON
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor - P4
Reporter: sandip Assignee: Unassigned
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates JAVA-1642 Mongo Connector does not extract vali... Closed

 Description   

Hi ,

I have data in such manner:

{ "_id" : ObjectId("54874f34062dfda18bcb47f5"), "a" : 2, "b" : 1, "c" : 1, "d" : "\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001", "e" :2}

From Mongo SHELL:

mongos> db.ascii.insert({ "_id" : ObjectId("54874f34062dfda18bcb47f5"), "a" : 2, "b" : 1, "c" : 1, "d" : "\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001", "e" :2})
WriteResult({ "nInserted" : 1 })
mongos>
mongos> db.ascii.find()
{ "_id" : ObjectId("54874f34062dfda18bcb47f5"), "a" : 2, "b" : 1, "c" : 1, "d" : "\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001", "e" : 2 }

From Java driver when I try this the output is not proper:

DBCursor cursor = table.find();
 
	while (cursor.hasNext()) {
		DBObject tmp = cursor.next();
		System.out.println(tmp);
	}

The output is :

{ "_id" : { "$oid" : "54874f34062dfda18bcb47f5"} , "a" : 2.0 , "b" : 1.0 , "c" : 1.0 , "d" : "" , "e" : 2.0}

You can see no data with d .

I find that :

BasicDBObject.toString call com.mongodb.util.JSON.serialize() to serialize the object to json string. To
serialize String type in Mongo, the following method com.mongodb.util.JSON.string is
used by mongo java driver:

   static void string( StringBuilder a , String s ){
        a.append("\"");
        for(int i = 0; i < s.length(); ++i){
            char c = s.charAt(i);
            if (c == '\\')
                a.append("\\\\");
            else if(c == '"')
                a.append("\\\"");
            else if(c == '\n')
                a.append("\\n");
            else if(c == '\r')
                a.append("\\r");
            else if(c == '\t')
                a.append("\\t");
            else if(c == '\b')
                a.append("\\b");
            else if ( c < 32 )
                continue;
            else
                a.append(c);
        }
        a.append("\"");
    }

From the lines:

 else if ( c < 32 )
    continue;

this method skip character 0-31, and do not handle control character in Unicode
format,thus u0001-u0019 character will be ignored when serialize BasicDBObject to
string.

how to handle this kind of unicode data ?



 Comments   
Comment by Jeffrey Yemin [ 09/Mar/17 ]

This is handled in the 3.x driver using the JsonReader class, which is used by BasicDBObject#toJson. This code, for example:

        BasicDBObject dbObject = new BasicDBObject("d", "\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001");
        System.out.println(dbObject.toJson());

outputs:

{ "d" : "\u0001\u0002\u0003\u0004\u0005\u0006\u0007\b\t" }

So prefer the toJson method to the toString method and the control characters will be properly output.

Comment by sandip [ 26/Feb/15 ]

Yes , Thank you.

Comment by Jeffrey Yemin [ 26/Feb/15 ]

This issue looks identical to JAVA-1642. Do you agree?

Generated at Thu Feb 08 08:55:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.