Hi!
I have experienced some weird behavior (like missing items) when caching in spark after loading data from MongoDB. I have tried to replicate the problem with minimum code:
case class Item(name: String, value: Double) // Loads collection with 10.000.000 items { name: String, value: Number } val imported = MongoSpark.load(sc, readConfig).toDS[Item] val result = imported.cache println(result.filter($"value" === 800).count()) // Prints 0!!!!!! (should be 1) println(result.unpersist.filter($"value" === 800).count()) // Prints 1
I have tried the same loading the data from a file (same items exported using mongoexport) and it works fine, it always prints 1. That's why I think that the problem is related with MongoSpark and not Spark itself.
Thanks,
Alejandro.
- duplicates
-
SPARK-151 Issues with the SamplePartitioner
- Closed