Uploaded image for project: 'Spark Connector'
  1. Spark Connector
  2. SPARK-221

Partitioners silently lose data when partitionKey has unmatching types

    • Type: Icon: Bug Bug
    • Resolution: Works as Designed
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.3.1
    • Component/s: Partitioners
    • None

      When loading data using the Mongospark connector for a collection which has unmatching types for the partition key, data is lost silently (using any partitioner).

      Steps to reproduce the problem:

      Fill the collection with half ObjectId, half string values for _id:

       

      for (let i=0; i<30000; ++i) {
          let doc = {}
          if (i % 2 === 0) doc._id = 'id_' + i.toString()
          doc.index = i
          db.foo.insert(doc)
      }
      

      Retrieve the data using MongoSamplePartitioner, MongoSplitVectorPartitioner or MongoSamplePartitioner and the default _id partition key:

       

       

      import com.mongodb.spark.MongoSpark
      import com.mongodb.spark.config.ReadConfig
      import org.apache.spark.sql.SparkSession
      
      case class MyRecord(
        index: Double
      )
      
      object Main {
        def main(args: Array[String]) = {
          val spark = SparkSession.builder()
            .appName("sample")
            .master("local[2]")
            .getOrCreate()
      
          import spark.implicits._
      
          val ds = MongoSpark.load[MyRecord](spark, ReadConfig(
            Map(
              "collection" -> "foo",
              "uri" -> "mongodb://localhost/mydb",
              "partitioner" -> "MongoSamplePartitioner",
              "partitionerOptions.partitionSizeMB" -> "1")
            )
          ).as[MyRecord]
      
          println("total: " + ds.count())
          println("evens (typeof _id = string): " + ds.filter(_.index % 2 == 0).count())
          println("odds (typeof _id = ObjectId): " + ds.filter(_.index % 2 == 1).count())
        }
      }
      

      This is the results I get locally (although I suspect they may vary):

       

       

      Partitioner Even records
      (typeof _id = string)
      Odd records
      (typeof _id == ObjectId)
      MongoSamplePartitioner 15000 0
      MongoSplitVectorPartitioner 14169 1661
      MongoPaginateByCountPartitioner 14977 0

       

       

            Assignee:
            ross@mongodb.com Ross Lawley
            Reporter:
            sacha.viscaino@gadz.org Sacha Viscaino
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: