Uploaded image for project: 'Spark Connector'
  1. Spark Connector
  2. SPARK-215

unable to read the mongodb data (json) in pyspark

    • Type: Icon: Task Task
    • Resolution: Done
    • Priority: Icon: Critical - P2 Critical - P2
    • None
    • Affects Version/s: 2.3.0
    • Component/s: Writes
    • None

      I am connecting the mongodb database via pymongo and achieved the expected result of fetching it outside the db in json format . but my task is that i need to create a hive table via pyspark , I found that mongodb provided json (RF719) which spark is not supporting .when i tried to load the data in pyspark (dataframe) it is showing as corrupted record. Please suggest a repsonse

      I am reading the data from pyspark using the below pyspark code

      from pyspark import SparkContext, SparkConf,StorageLevel 
      sc =SparkContext()
      from pyspark import HiveContext
      hiveContext = HiveContext(sc)
      from pyspark.sql import Row
      from pyspark.sql.functions import * 
      df=hiveContext.read.option("multiline","true").json(sc.wholeTextFiles('file:/data06/XXXXXXXXX.json').values()) 
      

      Please find the way it reads the data ------------------- | _corrupt_record| ------------------ |"[{\"finalization...| -------------------

            Assignee:
            ross@mongodb.com Ross Lawley
            Reporter:
            rjraman100 rajaraman
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: