-
Type: Task
-
Resolution: Gone away
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Spark Connector
-
Labels:None
-
Environment:Spark 2.2.0
As using null in scala is not idiomatic, Datasets automatically filter against documents in the database based off the fields in the Datasets case class.
---------------
was:
MongoRDD gets right count, but Dataset does not
When I read a collection from Mongo into a Dataset, some elements seems to be missing.
At first, I found this issue : https://jira.mongodb.org/browse/SPARK-154?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.2.1
So I switched from 2.2.0 to 2.2.1, but that didn't fix the problem.
val profileDs = MongoSpark.load(sparkSession.sparkContext) val rddCount = profileDs.count() val dsCount = profileDs.toDS[LightProfile]().count()
rddCount returns the good value (1.647.864), but dsCount does not (901.028).
LightProfile is a Scala case class. When I delete some fields from the class, the count on dsCount goes up, closer to the real value, until at some point, when I suppress enough fields, it gets the right count.
I tried all the available Partitioners, including the MongoPaginateBySizePartitioner with varying partition size, but none of them could change the results.
- related to
-
SPARK-182 MongoSpark connector silently ignores documents it cannot unmarshal
- Closed