Uploaded image for project: 'Spark Connector'
  1. Spark Connector
  2. SPARK-43

Ensure that Bson Types are preserved when round tripping dataframes.

    • Type: Icon: Improvement Improvement
    • Resolution: Done
    • Priority: Icon: Minor - P4 Minor - P4
    • 0.2
    • Affects Version/s: None
    • Component/s: None
    • Labels:
      None

      The Spark catalyst engine has a relatively small number of supported data types. Currently, ObjectId's are cast to a string but when saving back to MongoDB it loses that type information.


      Was: Identify and wrap _id columns with ObjectId when writing the dataframe

      When reading from mongo, the _id attribute is represented as string in the DataFrame. Given that you might do some transformations and later write back to mongo, the _id attribute is written as pure string. I wonder if it would be possible to detect whether a value is a valid ObjectId and wrap it when storing the dataframe back into Mongo?

          row.schema.fields.zipWithIndex.foreach({
            case (field, i) =>
              val data = field.dataType match {
                case arrayField: ArrayType if !row.isNullAt(i) => arrayTypeToData(arrayField, row.getSeq(i))
                case subDocument: StructType if !row.isNullAt(i) => rowToDocument(row.getStruct(i))
                case _ => if (field.name == "_id" && field.dataType.typeName == "string") new ObjectId(row.getString(i)) else row.get(i)
              }
              document.append(field.name, data)
          })
      
      

      In the rowToDocument function.

      I could imagine that maybe a regex test could be in place to make sure it is a valid ObjectID or alternatively use the StructField metadata to indicate that the column is an objectId when inferring the schema?

            Assignee:
            Unassigned Unassigned
            Reporter:
            lokm01 Jan Scherbaum
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: