Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Works as Designed
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 2.3.1
Component/s: Partitioners
Labels:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Link:
None
Goal Name(s):
None

When loading data using the Mongospark connector for a collection which has unmatching types for the partition key, data is lost silently (using any partitioner).

Steps to reproduce the problem:

Fill the collection with half ObjectId, half string values for _id:

for (let i=0; i<30000; ++i) {
    let doc = {}
    if (i % 2 === 0) doc._id = 'id_' + i.toString()
    doc.index = i
    db.foo.insert(doc)
}

Retrieve the data using MongoSamplePartitioner, MongoSplitVectorPartitioner or MongoSamplePartitioner and the default _id partition key:

import com.mongodb.spark.MongoSpark
import com.mongodb.spark.config.ReadConfig
import org.apache.spark.sql.SparkSession

case class MyRecord(
  index: Double
)

object Main {
  def main(args: Array[String]) = {
    val spark = SparkSession.builder()
      .appName("sample")
      .master("local[2]")
      .getOrCreate()

    import spark.implicits._

    val ds = MongoSpark.load[MyRecord](spark, ReadConfig(
      Map(
        "collection" -> "foo",
        "uri" -> "mongodb://localhost/mydb",
        "partitioner" -> "MongoSamplePartitioner",
        "partitionerOptions.partitionSizeMB" -> "1")
      )
    ).as[MyRecord]

    println("total: " + ds.count())
    println("evens (typeof _id = string): " + ds.filter(_.index % 2 == 0).count())
    println("odds (typeof _id = ObjectId): " + ds.filter(_.index % 2 == 1).count())
  }
}

This is the results I get locally (although I suspect they may vary):

Partitioner	Even records (typeof _id = string)	Odd records (typeof _id == ObjectId)
MongoSamplePartitioner	15000	0
MongoSplitVectorPartitioner	14169	1661
MongoPaginateByCountPartitioner	14977	0

is related to

SPARK-256 count sharding collection document number less than aggregate by mongo shell

Closed

Assignee:: Ross Lawley
Reporter:: Sacha Viscaino
Reviewers:: None
Votes:: 0 Vote for this issue
Watchers:: 1 Start watching this issue

Created:: Oct 23 2018 09:36:53 AM UTC
Updated:: Oct 27 2023 11:54:01 AM UTC
Resolved:: Oct 23 2018 10:58:47 AM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates