-
Type:
Task
-
Resolution: Cannot Reproduce
-
Priority:
Major - P3
-
None
-
Affects Version/s: 1.0.0
-
Component/s: None
-
None
-
Environment:pyspark 1.6.2 on databricks
-
None
-
None
-
None
-
None
-
None
-
None
currently the DataFrameWriter interface for using the connector with python seems to require a "reduce" or "action" type operation (.collect(), .count()) before writing. .cache() is insufficient (is lazy) we are writing with:
dataframe.write.format("com.mongodb.spark.sql.DefaultSource") .option("spark.mongodb.output.uri", mongo_uri) .option("spark.mongodb.output.database", database) .option("spark.mongodb.output.collection", collection).mode(mode).save()
This is connected to SPARK-74 in that we are doing a similar operation (to work around upsert – Read object (A), edit a subset of object A to create object (B), upsert B on object A manually to make AB, and then overwrite object AB. We have tried a number of combinations of .cache(), .count() before writing AB, and the results don't seem to be totally sensible, and are risky. I've included our trials and hypotheses about the causes of the behavior, is this the intended behavior of the connector??
- no .count(), no .cache():
**writes empty object (erases all data) - AB.count(), no .cache():
**writes only object B
**hypothesis - object A is discarded to free memory during .count(). Does in fact write. (is .write() actually lazy??) - AB.count(), AB.cache():
**writes AB correctly - AB.count(), A.cache():
- writes AB correctly, takes 2x as long
- hypotheses – needs to re-evaluate AB for .count() .write() (but does so)
*AB.cache(), no .count(): - writes empty object (erases data)