Uploaded image for project: 'Spark Connector'
  1. Spark Connector
  2. SPARK-272

Saving dataframe with null values omits null fields

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical - P2
    • Resolution: Won't Fix
    • Affects Version/s: 2.2.3, 2.4.0, 2.3.3, 2.4.2
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Case:

      Description

      When saving a dataframe to MongoDB with a nullable field, the field is omitted if only some of the values contain null. If all field values are null, the nulls are saved.

      Given the following simple program:
       

      import com.mongodb.spark._
      import com.mongodb.spark.config._
      import org.apache.spark.sql.SQLContext
       
      val sqlContext = SQLContext.getOrCreate(sc)
      val df = MongoSpark.load(sqlContext)
       
      println("DataFrame Schema")
      df.printSchema()
       
      df.show()
      val writeConfig = WriteConfig(Map("uri" -> sys.env("MONGO_URI"), "database" -> "test", "collection" -> "dataframe_output", "mode" ->"overwrite"))
      MongoSpark.save(df, writeConfig)
       
      println("Done")
      System.exit(0);
      

      When run against a source collection with the following documents:

      test> db.sample_data.find()
      {
        "_id": ObjectId("5eded30f4e63423e35515711"),
        "a": 1,
        "b": 1,
        "c": null
      }
      {
        "_id": ObjectId("5edfc0914e63423e35515712"),
        "a": 2,
        "b": 2,
        "c": null
      }
      

      Dataframe schema and data:

      DataFrame Schema
      root
       |-- _id: struct (nullable = true)
       |    |-- oid: string (nullable = true)
       |-- a: double (nullable = true)
       |-- b: double (nullable = true)
       |-- c: null (nullable = true)
       
      +--------------------+---+---+----+
      |                 _id|  a|  b|   c|
      +--------------------+---+---+----+
      |[5eded30f4e63423e...|1.0|1.0|null|
      |[5edfc0914e63423e...|2.0|2.0|null|
      +--------------------+---+---+----+
      

      And the output is as expected with the c fields being present in both output documents with a value of null:

      test> db.dataframe_output.find()
      {
        "_id": ObjectId("5eded30f4e63423e35515711"),
        "a": 1,
        "b": 1,
        "c": null
      }
      {
        "_id": ObjectId("5edfc0914e63423e35515712"),
        "a": 2,
        "b": 2,
        "c": null
      }
      

      If we alter the source collection so that one of the documents contains a value for c:

      test> db.sample_data.find()
      {
        "_id": ObjectId("5eded30f4e63423e35515711"),
        "a": 1,
        "b": 1,
        "c": 1
      }
      {
        "_id": ObjectId("5edfc0914e63423e35515712"),
        "a": 2,
        "b": 2,
        "c": null
      }
      

      Note that the c field is now implicitly typed as double rather than null:

      DataFrame Schema
      root
       |-- _id: struct (nullable = true)
       |    |-- oid: string (nullable = true)
       |-- a: double (nullable = true)
       |-- b: double (nullable = true)
       |-- c: double (nullable = true)
       
      +--------------------+---+---+----+
      |                 _id|  a|  b|   c|
      +--------------------+---+---+----+
      |[5eded30f4e63423e...|1.0|1.0| 1.0|
      |[5edfc0914e63423e...|2.0|2.0|null|
      +--------------------+---+---+----+
      

      And the resulting output omits the c field entirely if it contains a null:

      test> db.dataframe_output.find()
      {
        "_id": ObjectId("5eded30f4e63423e35515711"),
        "a": 1,
        "b": 1,
        "c": 1
      }
      {
        "_id": ObjectId("5edfc0914e63423e35515712"),
        "a": 2,
        "b": 2
      }
      

      The root cause of this anomaly is the following code in MapFunctions.scala:

        private[spark] def rowToDocumentMapper(schema: StructType, extendedBsonTypes: Boolean): (Row) => BsonDocument = {
          // foreach field type, decide what function to use to map its value
          val mappers = schema.fields.map({ field =>
            if (field.dataType == NullType) {
              (data: Any, document: BsonDocument) => document.append(field.name, new BsonNull())
            } else {
              val mapper = wrappedDataTypeToBsonValueMapper(field.dataType, field.nullable, extendedBsonTypes)
              (data: Any, document: BsonDocument) => if (data != null) document.append(field.name, mapper(data))
            }
          })
      

      If the field.dataType for the field is NullType, we output nulls. If the field.dataType is any other type, we only output values if the value is not null.

        Attachments

          Activity

            People

            Assignee:
            ross.lawley Ross Lawley
            Reporter:
            james.kovacs James Kovacs
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: