[DOCS-12255] [Spark] Document automatic collection filtering for Datasets Created: 06/Apr/18  Updated: 27/Oct/23  Resolved: 20/May/21

Status: Closed
Project: Documentation
Component/s: Spark Connector
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Maxime Assignee: Unassigned
Resolution: Gone away Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Spark 2.2.0


Issue Links:
Related
related to SPARK-182 MongoSpark connector silently ignores... Closed
Participants:
Days since reply: 5 years, 44 weeks, 1 day ago
Epic Link: DOCSP-6205

 Description   

As using null in scala is not idiomatic, Datasets automatically filter against documents in the database based off the fields in the Datasets case class.

---------------
was:

MongoRDD gets right count, but Dataset does not

When I read a collection from Mongo into a Dataset, some elements seems to be missing.

At first, I found this issue : https://jira.mongodb.org/browse/SPARK-154?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.2.1

So I switched from 2.2.0 to 2.2.1, but that didn't fix the problem.

val profileDs = MongoSpark.load(sparkSession.sparkContext)
val rddCount = profileDs.count()
val dsCount = profileDs.toDS[LightProfile]().count()

rddCount returns the good value (1.647.864), but dsCount does not (901.028).

LightProfile is a Scala case class. When I delete some fields from the class, the count on dsCount goes up, closer to the real value, until at some point, when I suppress enough fields, it gets the right count.

I tried all the available Partitioners, including the MongoPaginateBySizePartitioner with varying partition size, but none of them could change the results.



 Comments   
Comment by Maxime [ 11/Apr/18 ]

Yes this is it ! Thank you, I changed some fields type from T to Option[T] and then I got all my datas.

I re-read the mongo-spark doc, and I don't think this behaviour is obvious, I would have expected the field to be null if the data did not exist. Is it expected more generally from Spark, and not only from Mongo-Spark ?

Comment by Ross Lawley [ 09/Apr/18 ]

Hi Maillot,

Thanks for the ticket - please could you include the Case Class LightProfile?

When I delete some fields from the class, the count on dsCount goes up, closer to the real value, until at some point, when I suppress enough fields, it gets the right count.

It sounds like you have non optional fields in your Case Class and in the database there is no data for those fields - so they aren't included in the count. A Dataset requires values - where as an RDD is more akin to a Document (or Map) and can have variable keys and values. Does defining the optional fields as Option[T] fix the issue?

Ross

Comment by Maxime [ 09/Apr/18 ]

val rddCount = profileDs.withPipeline(Seq(Document.parse("{ $match : {locale:{$eq:\"fr_FR\"}}}"))).count()
val rddToDsCount = profileDs.withPipeline(Seq(Document.parse("{ $match : {locale:{$eq:\"fr_FR\"}}}"))).toDS[LightProfile]().count()
val dsCount = profileDs.toDS[LightProfile]().filter(profile => profile.locale == "fr_FR").count()

With the same code as above, rddCount has the right value, rddToDsCount and dsCount have the same false value.

Generated at Thu Feb 08 08:04:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.