Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 10.0.4
Affects Version/s: None
Component/s: None
Labels:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Link:
None
Goal Name(s):
None

From Community website

https://www.mongodb.com/community/forums/t/pyspark-job-keep-running-using-mongodb-spark-connector-v10-0-x/177302

strong text Spark 3.3.0, mongodb Atlas 5.0.9, Spark connector 10.x

I run a small job using pyspark reading from MongoDB Atlas and writing to BigQuery.

So far, with the MongoDB Spark connector v3.0.x, I did not encounter any errors and the job was ending normally after loading MongoDB documents and saving them into BigQuery.

It was only a few days ago that, after upgrading to the connector newest version (10.0.x), I’ve experienced some strange behavior: my job is still running even after finishing all tasks successfully.

Here is the problematic line (by that, I mean if I comment just this one, my whole job ends correctly) :

df = spark.read.format("mongodb").options(database="database", collection="collection").load()

Actually, it’s precisely the .load() part of this line which seems to be an issue, the rest of the line not causing any problem alone.

Every time from now, my last logs look like that :

22/07/27 23:00:09 INFO SparkUI: Stopped Spark web UI at http://192.168.1.173:404022/07/27 23:00:09 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!22/07/27 23:00:09 INFO MemoryStore: MemoryStore cleared22/07/27 23:00:09 INFO BlockManager: BlockManager stopped22/07/27 23:00:09 INFO BlockManagerMaster: BlockManagerMaster stopped22/07/27 23:00:09 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!22/07/27 23:00:09 INFO SparkContext: Successfully stopped SparkContext

But then I have to force quit (with Ctrl-C for instance when running locally) to actually finish the job. It’s very problematic when using cloud services like Google Dataproc Serverless for instance, as the job keep running and so, the instance is never stopped.

I tried with every version 10.0.x (x=0, 1, 2 and 3), but I always encounter the same behavior.

Is it something expected in this version 10 that I miss or not ?

I’ve tested it mainly with two collections: one very small with only 2 documents and another slightly larger with 20 000.

Here are the different jar I’ve tested to reproduce it:

(reader) mongodb-spark-connector (versions 10.0.0, 10.0.1, 10.0.2 & 10.0.3)
mongodb-driver-core / mongodb-driver-sync / bson (version 4.7.0 & 4.7.1)
(writer) spark-bigquery-with-dependencies_2.12-0.23.2

Again, process stops successfully using version 3.0.x of the mongodb-spark-connector jar with the other mongodb-driver jars.

I was suspecting at first a new behavior due to the support of the structured streaming but it seems not.

Assignee:: Ross Lawley
Reporter:: Robert Walters (Inactive)
Reviewers:: None
Votes:: 1 Vote for this issue
Watchers:: 3 Start watching this issue

Created:: Aug 05 2022 04:58:55 PM UTC
Updated:: Oct 28 2023 10:34:17 AM UTC
Resolved:: Aug 18 2022 10:59:01 AM UTC

Details

Description

Attachments

Activity

People

Dates