Uploaded image for project: 'Spark Connector'
  1. Spark Connector
  2. SPARK-285

spark connector failing to do complete load of a collection from mongoDB, Missing records.

    • Type: Icon: Task Task
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.2.8
    • Component/s: Configuration, Reads
    • Labels:
      None
    • Environment:
      Storage : 1PB
      Memory: 1.4 TB
      Node : 16 nodes cluster
      vcore CPU : 96

      Dear,

      When we are trying to load data via spark connector, it is failing to load the collection's document completely.

      Example :  Employee Collection have 500 documents.

      When we try to load this collection into a Data Frame using spark connectors, it is giving different load count.

      Sometimes it loaded completely and sometime it misses few records.

      Kindly suggest what am I doing wrong.

       

      Below is the sample command :

       

      spark-shell --master yarn --num-executors 10 --executor-memory 10g --executor-cores 8 --driver-memory 20g --jars /rtmstaging/BINS/CODEBASE/RAW_ZONE/SPARK/JAR/utils/mongo-spark-connector_2.11-2.3.3.jar,/rtmstaging/BINS/CODEBASE/RAW_ZONE/SPARK/JAR/utils/mongo-java-driver-3.12.2.jar

      import spark.implicits._
      val numLagDays =2;val numCurrentDays =1;

      val yes_dt = spark.sql("select date_sub(current_date(),"numLagDays")").as[(String)].first+"T20:00:00Z"

      val Tod_dt = spark.sql("select date_sub(current_date(),"numCurrentDays")").as[(String)].first+"T20:00:00Z"

      val pipeline_cdt ="[{ $match: {$or:[{$and:[{'CreatedDate' : {$gte : ISODate('" + yes_dt +"')}},\{'CreatedDate' : {$lt : ISODate('" + Tod_dt +"')}}]},{$and:[{'UpdatedDate' : {$gte : ISODate('" + yes_dt + "')}},\{'UpdatedDate' : {$lt : ISODate('" + Tod_dt+"')}}]}]} } ]"

      val dfr = spark.read.option("pipeline", pipeline_cdt)

      val df = dfr.format("mongo").option("uri","mongodb://BIG_DATA_USER:Sad#12345@10.10.10.10:27017/Reports.employee?authSource=admin&readPreference=secondary&appname=MongoDB%20Compass&ssl=false&replicaSet=Reports").load()

      df.show(false)

      df.count() varies if we load into data frame, but while checking the  MongoDB counts on MongoDB Compass/Robo3T it is showing the same.

       

      Its an intermitted issue and some time we miss records and some time we have whole records.

      Don't know what causes this issue.

       

      Regards,

      Sadique

            Assignee:
            ross@mongodb.com Ross Lawley
            Reporter:
            sadique.manzar@gmail.com Sadique Manzar
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: