Uploaded image for project: 'Documentation'
  1. Documentation
  2. DOCS-12255

[Spark] Document automatic collection filtering for Datasets

    • Type: Icon: Task Task
    • Resolution: Gone away
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Spark Connector
    • Labels:
      None
    • Environment:
      Spark 2.2.0

      As using null in scala is not idiomatic, Datasets automatically filter against documents in the database based off the fields in the Datasets case class.

      ---------------
      was:

      MongoRDD gets right count, but Dataset does not

      When I read a collection from Mongo into a Dataset, some elements seems to be missing.

      At first, I found this issue : https://jira.mongodb.org/browse/SPARK-154?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.2.1

      So I switched from 2.2.0 to 2.2.1, but that didn't fix the problem.

      val profileDs = MongoSpark.load(sparkSession.sparkContext)
      val rddCount = profileDs.count()
      val dsCount = profileDs.toDS[LightProfile]().count()
      

      rddCount returns the good value (1.647.864), but dsCount does not (901.028).

      LightProfile is a Scala case class. When I delete some fields from the class, the count on dsCount goes up, closer to the real value, until at some point, when I suppress enough fields, it gets the right count.

      I tried all the available Partitioners, including the MongoPaginateBySizePartitioner with varying partition size, but none of them could change the results.

            Assignee:
            Unassigned Unassigned
            Reporter:
            Maillot Maxime
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved:
              6 years, 2 weeks, 6 days ago