[KAFKA-343] Improve schema inference for documents nested in arrays Created: 03/Jan/23  Updated: 28/Oct/23  Resolved: 09/Jan/23

Status: Closed
Project: Kafka Connector
Component/s: Source
Affects Version/s: None
Fix Version/s: 1.9.0

Type: Improvement Priority: Unknown
Reporter: Jeffrey Yemin Assignee: Jeffrey Yemin
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Documented
Related
related to KAFKA-349 Schema inference fails with an Array ... Closed
related to KAFKA-175 Inferring schema should support varia... Closed
is related to SPARK-375 failed to infer schema of array field... Closed
Documentation Changes: Needed

 Description   

Schema inference for documents nested in arrays falls back to "string" when any difference is detected in the schemas for the nested documents. This is necessary because Kafka schemas can not handle arrays with elements of different type. But we can improve the schema inference to detect some cases where the schemas for the nested documents are actually compatible:

  1. Where the field is present in one document but missing in another
  2. Where the field is present in one document but null in another
  3. Where the field types conflict (in this case we can push the conflict down to the schema for the field)
  4. Where the field is an array with elements of some type in one document but an empty array in another


 Comments   
Comment by Githook User [ 09/Jan/23 ]

Author:

{'name': 'Jeff Yemin', 'email': 'jeff.yemin@mongodb.com', 'username': 'jyemin'}

Message: Combine compatible schemas for array elements (#128)

1. Where two struct types differ only where one has a field that the other does not. In this case, the
extra field is added to the combined schema
2. Where two struct types differ only where one has a field that the other has whose value is null in
the source document. In this case, the extra field is added to the combined schema with the type of the
field that has the non-null value.
3. Where two struct types differ where corresponding fields in the struct have a type conflict. In this case, the type
conflict is pushed down to the field, and the type of that field is what becomes string.
4. Where two array types differ only in that one of the arrays is empty. In this case, the value schema for the
empty array is changed to the one for the non-empty array

KAFKA-343

Co-authored-by: Ross Lawley <ross.lawley@gmail.com>
Branch: master
https://github.com/mongodb/mongo-kafka/commit/5530385ee5432559be558279532c1c21157c91fa

Comment by Ross Lawley [ 03/Jan/23 ]

Linking a similar issue from Spark

Generated at Thu Feb 08 09:06:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.