Uploaded image for project: 'Spark Connector'
  1. Spark Connector
  2. SPARK-149

Caching after MongoSpark.load yields false results

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Blocker - P1 Blocker - P1
    • 2.1.1, 2.2.1
    • Affects Version/s: 2.2.0
    • Component/s: None
    • Environment:
      mongodb 3.2.5, spark 2.2.0, scala 2.11.8

      Hi!

      I have experienced some weird behavior (like missing items) when caching in spark after loading data from MongoDB. I have tried to replicate the problem with minimum code:

      case class Item(name: String, value: Double)
      
      // Loads collection with 10.000.000 items { name: String,  value: Number }
      val imported = MongoSpark.load(sc, readConfig).toDS[Item]
      
      val result = imported.cache
      
      println(result.filter($"value" === 800).count())
      // Prints 0!!!!!! (should be 1)
      
      println(result.unpersist.filter($"value" === 800).count())
      // Prints 1
      
      

      I have tried the same loading the data from a file (same items exported using mongoexport) and it works fine, it always prints 1. That's why I think that the problem is related with MongoSpark and not Spark itself.

      Thanks,
      Alejandro.

            Assignee:
            ross@mongodb.com Ross Lawley
            Reporter:
            Trujillo Alejandro [X]
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: