Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Cannot Reproduce
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 1.0.0
Component/s: None
Labels:
None
Environment:
pyspark 1.6.2 on databricks

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Link:
None
Goal Name(s):
None

currently the DataFrameWriter interface for using the connector with python seems to require a "reduce" or "action" type operation (.collect(), .count()) before writing. .cache() is insufficient (is lazy) we are writing with:

dataframe.write.format("com.mongodb.spark.sql.DefaultSource")
  .option("spark.mongodb.output.uri", mongo_uri)  
  .option("spark.mongodb.output.database", database) 
  .option("spark.mongodb.output.collection", collection).mode(mode).save()

This is connected to ~~SPARK-74~~ in that we are doing a similar operation (to work around upsert – Read object (A), edit a subset of object A to create object (B), upsert B on object A manually to make AB, and then overwrite object AB. We have tried a number of combinations of .cache(), .count() before writing AB, and the results don't seem to be totally sensible, and are risky. I've included our trials and hypotheses about the causes of the behavior, is this the intended behavior of the connector??

no .count(), no .cache():
**writes empty object (erases all data)
AB.count(), no .cache():
**writes only object B
**hypothesis - object A is discarded to free memory during .count(). Does in fact write. (is .write() actually lazy??)
AB.count(), AB.cache():
**writes AB correctly
AB.count(), A.cache():
- writes AB correctly, takes 2x as long
- hypotheses – needs to re-evaluate AB for .count() .write() (but does so)
  *AB.cache(), no .count():
- writes empty object (erases data)

Assignee:: Unassigned
Reporter:: Mark Brenckle
Reviewers:: None
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: Aug 23 2016 05:58:56 PM UTC
Updated:: Sep 22 2021 06:49:49 PM UTC
Resolved:: Sep 12 2016 03:41:04 PM UTC

Details

Description

Attachments

Forms

Activity

People

Dates