-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: None
-
None
From Community website
strong text Spark 3.3.0, mongodb Atlas 5.0.9, Spark connector 10.x
I run a small job using pyspark reading from MongoDB Atlas and writing to BigQuery.
So far, with the MongoDB Spark connector v3.0.x, I did not encounter any errors and the job was ending normally after loading MongoDB documents and saving them into BigQuery.
It was only a few days ago that, after upgrading to the connector newest version (10.0.x), I’ve experienced some strange behavior: my job is still running even after finishing all tasks successfully.
Here is the problematic line (by that, I mean if I comment just this one, my whole job ends correctly) :
df = spark.read.format("mongodb").options(database="database", collection="collection").load()
Actually, it’s precisely the .load() part of this line which seems to be an issue, the rest of the line not causing any problem alone.
Every time from now, my last logs look like that :
22/07/27 23:00:09 INFO SparkUI: Stopped Spark web UI at http://192.168.1.173:404022/07/27 23:00:09 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!22/07/27 23:00:09 INFO MemoryStore: MemoryStore cleared22/07/27 23:00:09 INFO BlockManager: BlockManager stopped22/07/27 23:00:09 INFO BlockManagerMaster: BlockManagerMaster stopped22/07/27 23:00:09 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!22/07/27 23:00:09 INFO SparkContext: Successfully stopped SparkContext
But then I have to force quit (with Ctrl-C for instance when running locally) to actually finish the job. It’s very problematic when using cloud services like Google Dataproc Serverless for instance, as the job keep running and so, the instance is never stopped.
I tried with every version 10.0.x (x=0, 1, 2 and 3), but I always encounter the same behavior.
Is it something expected in this version 10 that I miss or not ?
I’ve tested it mainly with two collections: one very small with only 2 documents and another slightly larger with 20 000.
Here are the different jar I’ve tested to reproduce it:
- (reader) mongodb-spark-connector (versions 10.0.0, 10.0.1, 10.0.2 & 10.0.3)
- mongodb-driver-core / mongodb-driver-sync / bson (version 4.7.0 & 4.7.1)
- (writer) spark-bigquery-with-dependencies_2.12-0.23.2
Again, process stops successfully using version 3.0.x of the mongodb-spark-connector jar with the other mongodb-driver jars.
I was suspecting at first a new behavior due to the support of the structured streaming but it seems not.