Uploaded image for project: 'Spark Connector'
  1. Spark Connector
  2. SPARK-358

Pyspark job keep running using MongoDB Spark connector v10.0.x

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 10.0.4
    • Affects Version/s: None
    • Component/s: None
    • Labels:
      None

      From Community website

      https://www.mongodb.com/community/forums/t/pyspark-job-keep-running-using-mongodb-spark-connector-v10-0-x/177302

      strong text Spark 3.3.0, mongodb Atlas 5.0.9, Spark connector 10.x

      I run a small job using pyspark reading from MongoDB Atlas and writing to BigQuery.

      So far, with the MongoDB Spark connector v3.0.x, I did not encounter any errors and the job was ending normally after loading MongoDB documents and saving them into BigQuery.

      It was only a few days ago that, after upgrading to the connector newest version (10.0.x), I’ve experienced some strange behavior: my job is still running even after finishing all tasks successfully.

      Here is the problematic line (by that, I mean if I comment just this one, my whole job ends correctly) :

       

      df = spark.read.format("mongodb").options(database="database", collection="collection").load()

      Actually, it’s precisely the .load() part of this line which seems to be an issue, the rest of the line not causing any problem alone.

      Every time from now, my last logs look like that :

       

      22/07/27 23:00:09 INFO SparkUI: Stopped Spark web UI at http://192.168.1.173:404022/07/27 23:00:09 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!22/07/27 23:00:09 INFO MemoryStore: MemoryStore cleared22/07/27 23:00:09 INFO BlockManager: BlockManager stopped22/07/27 23:00:09 INFO BlockManagerMaster: BlockManagerMaster stopped22/07/27 23:00:09 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!22/07/27 23:00:09 INFO SparkContext: Successfully stopped SparkContext

      But then I have to force quit (with Ctrl-C for instance when running locally) to actually finish the job. It’s very problematic when using cloud services like Google Dataproc Serverless for instance, as the job keep running and so, the instance is never stopped.

      I tried with every version 10.0.x (x=0, 1, 2 and 3), but I always encounter the same behavior.

      Is it something expected in this version 10 that I miss or not ?

      I’ve tested it mainly with two collections: one very small with only 2 documents and another slightly larger with 20 000.

      Here are the different jar I’ve tested to reproduce it:

      • (reader) mongodb-spark-connector (versions 10.0.0, 10.0.1, 10.0.2 & 10.0.3)
      • mongodb-driver-core / mongodb-driver-sync / bson (version 4.7.0 & 4.7.1)
      • (writer) spark-bigquery-with-dependencies_2.12-0.23.2

      Again, process stops successfully using version 3.0.x of the mongodb-spark-connector jar with the other mongodb-driver jars.

      I was suspecting at first a new behavior due to the support of the structured streaming but it seems not.

       

       

            Assignee:
            ross@mongodb.com Ross Lawley
            Reporter:
            robert.walters@mongodb.com Robert Walters
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: