Mongo Connector does not extract valid control character from 0x00 to 0x1F." issue against T2Mongo Connector

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Duplicate
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: JSON
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Teradata and MongoDB are in the process of releasing a T2Mongo connector. Teradata ran into the below issue and need help with diagnosis & a possible solution/workaround.

      Description:

      We are trying to import some control characters. The result is correct when it return as BasicDBObject. However, we need to cast it to java string, in case, we use BasicDBObject.toString method provided by Mongo Java driver. This method call com.mongodb.util.JSON.serialize() to serialize object. I’ve looked source code, and found the following method is called to serialize string type:

         static void string( StringBuilder a , String s ){
              a.append("\"");
              for(int i = 0; i < s.length(); ++i){
                  char c = s.charAt(i);
                  if (c == '\\')
                      a.append("\\\\");
                  else if(c == '"')
                      a.append("\\\"");
                  else if(c == '\n')
                      a.append("\\n");
                  else if(c == '\r')
                      a.append("\\r");
                  else if(c == '\t')
                      a.append("\\t");
                  else if(c == '\b')
                      a.append("\\b");
                  else if ( c < 32 )
                      continue;
                  else
                      a.append(c);
              }
              a.append("\"");
      }
      

      From the line:

                  else if ( c < 32 )
                      continue;
      

      it skips character 0 -31, and do not handle whitespace character in Unicode format, thus \u0001 - \u0019 will be ignored, thus, we cannot extract those Unicode character. I also realize that if I implement a simple JSONSerializer(cast BasicDBObject to Map and construct json document from key value pairs), the data type information will be lost, that is ,when export those data back to mongo side, mongo can only recognize the data as String type. Regard to this issue, could you help me to know is this a bug of Mongo java driver or it is intended to do, is there any workaround to keep those unicode characters when serialize BasicDBObject to String.

      after I implement a simple JSONSerializer, I noticed DBS side unable to extract json document column if the column contain some unicode characters in format ‘\uxxx’ such as \u0001, for example,

      Data in utf8 collection :

      { "_id" : ObjectId("54d49a22c6ee70b789a21d55"), "utf8string" : "\u0001\u0001\u0001" }

      It is ok to do:

      select * from Foreign Table(@BEGIN_PASS_THRU test.utf8.find()@END_PASS_THRU)@Mongo as T;

      {"_id":"54d49a3dc6ee70b789a21d56","utf8string":" "}

      But if I run
      select MongoData from Foreign Table(@BEGIN_PASS_THRU test.utf8.find()@END_PASS_THRU)@Mongo as T;
      or
      select MongoData. utf8string from Foreign Table(@BEGIN_PASS_THRU test.utf8.find()@END_PASS_THRU)@Mongo as T;

      it error out with:

          • Failure 7548 Invalid JSON data: Expected something like whitespace or ' {' or '}

            ' or '[' or ']' or ':' or ',' or '"' or '\' between '"' and '\0001' at character position 48. Make sure data was not truncated.
            Statement# 1, Info =0

            Assignee:
            Unassigned
            Reporter:
            Muthu Chinnasamy (Inactive)
            None
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: