Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Fixed
Priority: Unknown
Fix Version/s: 10.1.0
Affects Version/s: None
Component/s: None
Labels:
None

Case:

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Link:
None
Goal Name(s):
None

What did I do

I use mongodb spark connector to dump data from mongodb to databricks.

I have two records in mongodb

properties
[\{kind: 234}, \{value: "orange"}, \{_id:"abc}]
[]

This schema of this column is inferred as an array of StringType.

What do I want

This schema of this column should be inferred as an array of StructType(StructField(kind,IntegerType,true),StructField(value,StringType,true),StructField(_id,StringType,true)).

Why do I need it

I need to dump data from mongodb to databricks table batch by batch.

Now the column is inferred as array of string in one batch, but array of struct in another batch. As a result, I will receive error when I try to merge this two batches

AnalysisException: Failed to merge fields 'xxx' and 'xxx'. Failed to merge incompatible data types StringType and StructType(StructField(kind,IntegerType,true),StructField(value,StringType,true),StructField(_id,StringType,true))

I want to have a consistant schema between batches.

Having https://jira.mongodb.org/projects/SPARK/issues/SPARK-365 may help on resolving this issue.

is related to

SPARK-365 Add schemaHints option for inferring schema

Closed

related to

SPARK-398 Regression determining the schema for map containing empty list.

Closed

KAFKA-343 Improve schema inference for documents nested in arrays

Closed

Assignee:: Ross Lawley
Reporter:: Kit Yam Tse
Reviewers:: None
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: Oct 28 2022 05:01:14 AM UTC
Updated:: Oct 28 2023 10:34:19 AM UTC
Resolved:: Dec 07 2022 10:24:02 AM UTC

Details

Description

What did I do

What do I want

Why do I need it

Attachments

Issue Links

Activity

People

Dates