-
Type: Bug
-
Resolution: Works as Designed
-
Priority: Major - P3
-
None
-
Affects Version/s: 2.3.1
-
Component/s: Partitioners
-
None
When loading data using the Mongospark connector for a collection which has unmatching types for the partition key, data is lost silently (using any partitioner).
Steps to reproduce the problem:
Fill the collection with half ObjectId, half string values for _id:
for (let i=0; i<30000; ++i) { let doc = {} if (i % 2 === 0) doc._id = 'id_' + i.toString() doc.index = i db.foo.insert(doc) }
Retrieve the data using MongoSamplePartitioner, MongoSplitVectorPartitioner or MongoSamplePartitioner and the default _id partition key:
import com.mongodb.spark.MongoSpark import com.mongodb.spark.config.ReadConfig import org.apache.spark.sql.SparkSession case class MyRecord( index: Double ) object Main { def main(args: Array[String]) = { val spark = SparkSession.builder() .appName("sample") .master("local[2]") .getOrCreate() import spark.implicits._ val ds = MongoSpark.load[MyRecord](spark, ReadConfig( Map( "collection" -> "foo", "uri" -> "mongodb://localhost/mydb", "partitioner" -> "MongoSamplePartitioner", "partitionerOptions.partitionSizeMB" -> "1") ) ).as[MyRecord] println("total: " + ds.count()) println("evens (typeof _id = string): " + ds.filter(_.index % 2 == 0).count()) println("odds (typeof _id = ObjectId): " + ds.filter(_.index % 2 == 1).count()) } }
This is the results I get locally (although I suspect they may vary):
Partitioner | Even records (typeof _id = string) |
Odd records (typeof _id == ObjectId) |
---|---|---|
MongoSamplePartitioner | 15000 | 0 |
MongoSplitVectorPartitioner | 14169 | 1661 |
MongoPaginateByCountPartitioner | 14977 | 0 |
- is related to
-
SPARK-256 count sharding collection document number less than aggregate by mongo shell
- Closed