[DOCS-12255] [Spark] Document automatic collection filtering for Datasets Created: 06/Apr/18 Updated: 27/Oct/23 Resolved: 20/May/21 |
|
| Status: | Closed |
| Project: | Documentation |
| Component/s: | Spark Connector |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Maxime | Assignee: | Unassigned |
| Resolution: | Gone away | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Spark 2.2.0 |
||
| Issue Links: |
|
||||||||
| Participants: | |||||||||
| Days since reply: | 5 years, 44 weeks, 1 day ago | ||||||||
| Epic Link: | DOCSP-6205 | ||||||||
| Description |
|
As using null in scala is not idiomatic, Datasets automatically filter against documents in the database based off the fields in the Datasets case class. --------------- MongoRDD gets right count, but Dataset does notWhen I read a collection from Mongo into a Dataset, some elements seems to be missing. At first, I found this issue : https://jira.mongodb.org/browse/SPARK-154?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.2.1 So I switched from 2.2.0 to 2.2.1, but that didn't fix the problem.
rddCount returns the good value (1.647.864), but dsCount does not (901.028). LightProfile is a Scala case class. When I delete some fields from the class, the count on dsCount goes up, closer to the real value, until at some point, when I suppress enough fields, it gets the right count. I tried all the available Partitioners, including the MongoPaginateBySizePartitioner with varying partition size, but none of them could change the results. |
| Comments |
| Comment by Maxime [ 11/Apr/18 ] | |||
|
Yes this is it ! Thank you, I changed some fields type from T to Option[T] and then I got all my datas. I re-read the mongo-spark doc, and I don't think this behaviour is obvious, I would have expected the field to be null if the data did not exist. Is it expected more generally from Spark, and not only from Mongo-Spark ? | |||
| Comment by Ross Lawley [ 09/Apr/18 ] | |||
|
Hi Maillot, Thanks for the ticket - please could you include the Case Class LightProfile?
It sounds like you have non optional fields in your Case Class and in the database there is no data for those fields - so they aren't included in the count. A Dataset requires values - where as an RDD is more akin to a Document (or Map) and can have variable keys and values. Does defining the optional fields as Option[T] fix the issue? Ross | |||
| Comment by Maxime [ 09/Apr/18 ] | |||
With the same code as above, rddCount has the right value, rddToDsCount and dsCount have the same false value. |