-
Type: Bug
-
Resolution: Won't Fix
-
Priority: Critical - P2
-
None
-
Affects Version/s: 2.2.3, 2.4.0, 2.3.3, 2.4.2
-
Component/s: None
-
Labels:None
-
(copied to CRM)
When saving a dataframe to MongoDB with a nullable field, the field is omitted if only some of the values contain null. If all field values are null, the nulls are saved.
Given the following simple program:
import com.mongodb.spark._ import com.mongodb.spark.config._ import org.apache.spark.sql.SQLContext val sqlContext = SQLContext.getOrCreate(sc) val df = MongoSpark.load(sqlContext) println("DataFrame Schema") df.printSchema() df.show() val writeConfig = WriteConfig(Map("uri" -> sys.env("MONGO_URI"), "database" -> "test", "collection" -> "dataframe_output", "mode" ->"overwrite")) MongoSpark.save(df, writeConfig) println("Done") System.exit(0);
When run against a source collection with the following documents:
test> db.sample_data.find() { "_id": ObjectId("5eded30f4e63423e35515711"), "a": 1, "b": 1, "c": null } { "_id": ObjectId("5edfc0914e63423e35515712"), "a": 2, "b": 2, "c": null }
Dataframe schema and data:
DataFrame Schema root |-- _id: struct (nullable = true) | |-- oid: string (nullable = true) |-- a: double (nullable = true) |-- b: double (nullable = true) |-- c: null (nullable = true) +--------------------+---+---+----+ | _id| a| b| c| +--------------------+---+---+----+ |[5eded30f4e63423e...|1.0|1.0|null| |[5edfc0914e63423e...|2.0|2.0|null| +--------------------+---+---+----+
And the output is as expected with the c fields being present in both output documents with a value of null:
test> db.dataframe_output.find() { "_id": ObjectId("5eded30f4e63423e35515711"), "a": 1, "b": 1, "c": null } { "_id": ObjectId("5edfc0914e63423e35515712"), "a": 2, "b": 2, "c": null }
If we alter the source collection so that one of the documents contains a value for c:
test> db.sample_data.find() { "_id": ObjectId("5eded30f4e63423e35515711"), "a": 1, "b": 1, "c": 1 } { "_id": ObjectId("5edfc0914e63423e35515712"), "a": 2, "b": 2, "c": null }
Note that the c field is now implicitly typed as double rather than null:
DataFrame Schema root |-- _id: struct (nullable = true) | |-- oid: string (nullable = true) |-- a: double (nullable = true) |-- b: double (nullable = true) |-- c: double (nullable = true) +--------------------+---+---+----+ | _id| a| b| c| +--------------------+---+---+----+ |[5eded30f4e63423e...|1.0|1.0| 1.0| |[5edfc0914e63423e...|2.0|2.0|null| +--------------------+---+---+----+
And the resulting output omits the c field entirely if it contains a null:
test> db.dataframe_output.find() { "_id": ObjectId("5eded30f4e63423e35515711"), "a": 1, "b": 1, "c": 1 } { "_id": ObjectId("5edfc0914e63423e35515712"), "a": 2, "b": 2 }
The root cause of this anomaly is the following code in MapFunctions.scala:
private[spark] def rowToDocumentMapper(schema: StructType, extendedBsonTypes: Boolean): (Row) => BsonDocument = { // foreach field type, decide what function to use to map its value val mappers = schema.fields.map({ field => if (field.dataType == NullType) { (data: Any, document: BsonDocument) => document.append(field.name, new BsonNull()) } else { val mapper = wrappedDataTypeToBsonValueMapper(field.dataType, field.nullable, extendedBsonTypes) (data: Any, document: BsonDocument) => if (data != null) document.append(field.name, mapper(data)) } })
If the field.dataType for the field is NullType, we output nulls. If the field.dataType is any other type, we only output values if the value is not null.