unable to read the mongodb data (json) in pyspark

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Done
    • Priority: Critical - P2
    • None
    • Affects Version/s: 2.3.0
    • Component/s: Writes
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      I am connecting the mongodb database via pymongo and achieved the expected result of fetching it outside the db in json format . but my task is that i need to create a hive table via pyspark , I found that mongodb provided json (RF719) which spark is not supporting .when i tried to load the data in pyspark (dataframe) it is showing as corrupted record. Please suggest a repsonse

      I am reading the data from pyspark using the below pyspark code

      from pyspark import SparkContext, SparkConf,StorageLevel 
      sc =SparkContext()
      from pyspark import HiveContext
      hiveContext = HiveContext(sc)
      from pyspark.sql import Row
      from pyspark.sql.functions import * 
      df=hiveContext.read.option("multiline","true").json(sc.wholeTextFiles('file:/data06/XXXXXXXXX.json').values()) 
      

      Please find the way it reads the data ------------------- | _corrupt_record| ------------------ |"[{\"finalization...| -------------------

              Assignee:
              Ross Lawley
              Reporter:
              rajaraman
              None
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: