Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Blocker - P1
Fix Version/s: 2.1.1, 2.2.1
Affects Version/s: 2.2.0
Component/s: None
Labels:
- caching
- load
Environment:
mongodb 3.2.5, spark 2.2.0, scala 2.11.8

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Link:
None
Goal Name(s):
None

Hi!

I have experienced some weird behavior (like missing items) when caching in spark after loading data from MongoDB. I have tried to replicate the problem with minimum code:

case class Item(name: String, value: Double)

// Loads collection with 10.000.000 items { name: String,  value: Number }
val imported = MongoSpark.load(sc, readConfig).toDS[Item]

val result = imported.cache

println(result.filter($"value" === 800).count())
// Prints 0!!!!!! (should be 1)

println(result.unpersist.filter($"value" === 800).count())
// Prints 1

I have tried the same loading the data from a file (same items exported using mongoexport) and it works fine, it always prints 1. That's why I think that the problem is related with MongoSpark and not Spark itself.

Thanks,
Alejandro.

duplicates

SPARK-151 Issues with the SamplePartitioner

Closed

Assignee:: Ross Lawley
Reporter:: Alejandro [X]
Reviewers:: None
Votes:: 0 Vote for this issue
Watchers:: 3 Start watching this issue

Created:: Oct 19 2017 08:10:55 AM UTC
Updated:: Oct 28 2023 10:34:04 AM UTC
Resolved:: Oct 31 2017 05:16:30 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates