Uploaded image for project: 'Spark Connector'
  1. Spark Connector
  2. SPARK-365

Add schemaHints option for inferring schema

    • Type: Icon: New Feature New Feature
    • Resolution: Unresolved
    • Priority: Icon: Minor - P4 Minor - P4
    • None
    • Affects Version/s: 10.0.4
    • Component/s: None
    • Labels:
      None

      Background

      I have a mongo collection that has 8M doc with a lot of fields. Half of the doc have a field metadata in string, and the rest of them have metadata in object type.

      What is my issue

      When I dump the data from mongodb to databricks using mongodb spark connector, sometime it success and sometime I get

      com.mongodb.spark.sql.connector.exceptions.DataException: Invalid field: 'uriMetadata'. The dataType 'struct' is invalid for 'BsonString{value='xxxxxxx'}'.
      

      I think the failure is because the connector infers schema with only doc have object type metadata value. And then, the metadata column becomes a struct column in databricks, and the job fails because we can't insert string data to a struct column.

      What do I want

      I would like to have something similar to schemaHints in mongodb spark connector, such that I can provide a schema hint to only metadata column, suggesting it to be a string column.

      What have I considered

      1. Increase sampleSize
        I know I can set the sampleSize to increase the chance that the connector infer schema from sample that contains string metadata value. However, it is still not guarantee string metadata value will be included in the sample and metadata can still be inferring as a struct column
      2. Provide full schema with .schema(my_schjema)
        My collection has many fields and complicated nested schema. We may also introduce new fields to the collections from time to time. It is difficult for me to define a full schema of the collections. As a result, I would just like to partially define the schema for some fields only.

            Assignee:
            Unassigned Unassigned
            Reporter:
            me@kytse.com Kit Yam Tse
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: