-
Type: Task
-
Resolution: Done
-
Priority: Critical - P2
-
None
-
Affects Version/s: 2.3.0
-
Component/s: Writes
-
None
I am connecting the mongodb database via pymongo and achieved the expected result of fetching it outside the db in json format . but my task is that i need to create a hive table via pyspark , I found that mongodb provided json (RF719) which spark is not supporting .when i tried to load the data in pyspark (dataframe) it is showing as corrupted record. Please suggest a repsonse
I am reading the data from pyspark using the below pyspark code
from pyspark import SparkContext, SparkConf,StorageLevel sc =SparkContext() from pyspark import HiveContext hiveContext = HiveContext(sc) from pyspark.sql import Row from pyspark.sql.functions import * df=hiveContext.read.option("multiline","true").json(sc.wholeTextFiles('file:/data06/XXXXXXXXX.json').values())
Please find the way it reads the data ------------------- | _corrupt_record| ------------------ |"[{\"finalization...| -------------------