-
Type: Improvement
-
Resolution: Fixed
-
Priority: Unknown
-
Affects Version/s: None
-
Component/s: None
-
None
-
(copied to CRM)
What did I do
I use mongodb spark connector to dump data from mongodb to databricks.
I have two records in mongodb
properties |
---|
[\{kind: 234}, \{value: "orange"}, \{_id:"abc}] |
[] |
This schema of this column is inferred as an array of StringType.
What do I want
This schema of this column should be inferred as an array of StructType(StructField(kind,IntegerType,true),StructField(value,StringType,true),StructField(_id,StringType,true)).
Why do I need it
I need to dump data from mongodb to databricks table batch by batch.
Now the column is inferred as array of string in one batch, but array of struct in another batch. As a result, I will receive error when I try to merge this two batches
AnalysisException: Failed to merge fields 'xxx' and 'xxx'. Failed to merge incompatible data types StringType and StructType(StructField(kind,IntegerType,true),StructField(value,StringType,true),StructField(_id,StringType,true))
I want to have a consistant schema between batches.
Having https://jira.mongodb.org/projects/SPARK/issues/SPARK-365 may help on resolving this issue.