Uploaded image for project: 'Spark Connector'
  1. Spark Connector
  2. SPARK-146

Python only supports DataFrames and not RDDs

    • Type: Icon: Improvement Improvement
    • Resolution: Works as Designed
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.2.0
    • Component/s: API
    • Labels:
    • Environment:
      Spark with Mongo Connector

      Title was: "Can't write json RDD to Mongo which Mongo is famous for schema-less design"

      The problem is we can't use Python spark-mongo-connector to write RDD to MongoDB.
      Note: we have tried the latest version.

      Business impact:
      all of Python users who are trying to write dynamic schema data into MongoDB.
      Mongo is famous for schema-less design, however only scala spark-mongo-connector can write RDD with dynamic schema back to MongoDB, python users are suffering.

      Dynamic Schema Challenge
      Spark has RDD and DataFrame, by design, RDD supports dynamic schema, DataFrame only support explicit schema for better performance.
      Mongo Spark Connector Scala API supports RDD read&write, but Python API does not. Python API only support DataFrame which will not support dynamic schema by design of Spark.

      ----Workaround for Read phase, completed
      1. read Mongo documents to DF
      2. dump data to Json String
      3. transfer it to TD Spark application

      ----Blocking issue in Write phase, pending on Mongo Spark team
      For write, we parse the string to dynamic schema dictionary into RDD, however we can't push it to connector without transfer to DataFrame.
      I think we need to consulting with Mongo Spark Team, once Mongo Spark can support RDD writing, we can migrate all codes to Python.

      Issue History:
      1. RDD approach has been deprecated in mongo-hadoop project March 2016.
      RDD saveAsNewAPIHadoopFile which used to write data into MongoDB has been deprecated.
      rdd.saveAsNewAPIHadoopFile(
      path='file:///this-is-unused',
      outputFormatClass='com.mongodb.hadoop.MongoOutputFormat',
      keyClass='org.apache.hadoop.io.Text',
      valueClass='org.apache.hadoop.io.MapWritable',
      conf=

      { 'mongo.output.uri': 'mongodb://t2cUserQA:G05hark5@qa-t2c-node1.paradata.io:27017/t2c.JasonpartFromSpark2' }

      )
      Announced @: https://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage
      2. objectID issue: resolved.... when converting to DataFrame found: "TypeError: not supported type: <class 'bson.objectid.ObjectId'>"
      tracking by: https://jira.mongodb.org/browse/HADOOP-277
      Schema related issues:
      3. StructType issue: "com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast ARRAY into a StructType"
      tracking by: https://groups.google.com/forum/#!topic/mongodb-user/lQjppYa21mQ
      4. repartition issue:
      "Cannot cast ARRAY into a StructType(StructField(0,StringType,true), StructField(1,StringType,true), StructField(2,StringType,true), StructField(3,StringType,true), StructField(4,StringType,true)) (value: BsonArray{values=[BsonString

      {value='Logic'}

      , BsonString

      {value='Logic ICs'}

      ]})"
      tracking by: https://groups.google.com/forum/#!topic/mongodb-user/lQjppYa21mQ

            Assignee:
            ross@mongodb.com Ross Lawley
            Reporter:
            nbajason Chao Zhang
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: