-
Type: Task
-
Resolution: Done
-
Priority: Major - P3
-
None
-
Affects Version/s: 2.2.8
-
Component/s: Configuration, Reads
-
None
-
Environment:Storage : 1PB
Memory: 1.4 TB
Node : 16 nodes cluster
vcore CPU : 96
Dear,
When we are trying to load data via spark connector, it is failing to load the collection's document completely.
Example : Employee Collection have 500 documents.
When we try to load this collection into a Data Frame using spark connectors, it is giving different load count.
Sometimes it loaded completely and sometime it misses few records.
Kindly suggest what am I doing wrong.
Below is the sample command :
spark-shell --master yarn --num-executors 10 --executor-memory 10g --executor-cores 8 --driver-memory 20g --jars /rtmstaging/BINS/CODEBASE/RAW_ZONE/SPARK/JAR/utils/mongo-spark-connector_2.11-2.3.3.jar,/rtmstaging/BINS/CODEBASE/RAW_ZONE/SPARK/JAR/utils/mongo-java-driver-3.12.2.jar
import spark.implicits._
val numLagDays =2;val numCurrentDays =1;
val yes_dt = spark.sql("select date_sub(current_date(),"numLagDays")").as[(String)].first+"T20:00:00Z"
val Tod_dt = spark.sql("select date_sub(current_date(),"numCurrentDays")").as[(String)].first+"T20:00:00Z"
val pipeline_cdt ="[{ $match: {$or:[{$and:[{'CreatedDate' : {$gte : ISODate('" + yes_dt +"')}},\{'CreatedDate' : {$lt : ISODate('" + Tod_dt +"')}}]},{$and:[{'UpdatedDate' : {$gte : ISODate('" + yes_dt + "')}},\{'UpdatedDate' : {$lt : ISODate('" + Tod_dt+"')}}]}]} } ]"
val dfr = spark.read.option("pipeline", pipeline_cdt)
val df = dfr.format("mongo").option("uri","mongodb://BIG_DATA_USER:Sad#12345@10.10.10.10:27017/Reports.employee?authSource=admin&readPreference=secondary&appname=MongoDB%20Compass&ssl=false&replicaSet=Reports").load()
df.show(false)
df.count() varies if we load into data frame, but while checking the MongoDB counts on MongoDB Compass/Robo3T it is showing the same.
Its an intermitted issue and some time we miss records and some time we have whole records.
Don't know what causes this issue.
Regards,
Sadique