[KAFKA-175] Inferring schema should support variable types for uses with Json with Schema. Created: 19/Nov/20  Updated: 27/Oct/23  Resolved: 03/Jan/23

Status: Closed
Project: Kafka Connector
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Robert Walters Assignee: Ross Lawley
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to KAFKA-343 Improve schema inference for document... Closed
Case:

 Description   

Schema inference uses the base type when determining the schema for arrays. So when sourcing the following document structure:

{
    "L1": {
      "L2": {
        "L3": [ {"V2": {"K1": 0},"K1": 0},  {"V5": ["A1", "A2"], "V11": 1} ]
      }
    }
  }

The type of L3 is Array with a value type of Schema.STRING:

  "fullDocument": {
    "_id": "5fb67d988f8729ab566e4f6b",
    "L1": {
      "L2": {
        "L3": [ "{\"V2\": {\"K1\": 0}, \"K1\": 0}","{\"V5\": [\"A1\", \"A2\"], \"V11\": 1}" ]
      }
    }
  },

Configuration:

{
  "key.converter.schemas.enable": "false",
  "value.converter.schemas.enable": "false",
  "connector.class": "com.mongodb.kafka.connect.MongoSourceConnector",
  "tasks.max": "1",
  "key.converter": "org.apache.kafka.connect.storage.StringConverter",
  "value.converter": "org.apache.kafka.connect.json.JsonConverter",
  "errors.log.enable": "true",
  "errors.log.include.messages": "true",
  "connection.uri":"CONECTIONSTRING",
  "database": "testdb",
  "collection": "testcol",
  "topic.prefix": "test-prefix",
  "output.format.key": "json",
  "output.format.value": "schema",
  "output.schema.infer.value": "true",
  "output.json.formatter": "com.mongodb.kafka.connect.source.json.formatter.SimplifiedJson",
  "copy.existing": "true"
}

Json schemas do allow variable object types for Structs and Arrays: Array Compatibility. So when output.schema.infer.value=true then when providing schema for Json with schema then there should be no use of a Base type. Note this will require an extra configuration eg: "output.schema.infer.compatibility:[none|all]" - default to all compatibility to keep the current behaviour.

For reference see:
https://developer.mongodb.com/community/forums/t/array-of-objects-become-array-of-string-during-upload-to-kafka/11509/3



 Comments   
Comment by Ross Lawley [ 03/Jan/23 ]

Marking as won't fix for the reasons provided.

KAFKA-343 looks to improve schema inferrance so that complex arrays can be better supported.

Comment by Ross Lawley [ 03/Jan/23 ]

Having reviewed the API's available the Schema.Type#Array is:

An ordered sequence of elements, each of which shares the same type.

So there is no way in the SourceRecord API to natively support this.

While "Json with schema should be able to support varying types for all data" is true, the connector has to produce SourceRecords which has its own schema restrictions. Converters (eg to Json / Json with schema) are applied once the SourceRecord using the schema'd information is produced.

So to handle multiple types of data, producing an Array of Json strings is the workaround for this limitation.

Comment by Ross Lawley [ 20/Nov/20 ]

I've reopened as Json with schema should be able to support varying types for all data. It's not obvious how to achieve that using the SchemaBuilder API.

Comment by Ross Lawley [ 19/Nov/20 ]

Hi robert.walters.

This is "works as designed". Arrays have to have fixed schemas for the value type. Here the array has two totally different schema'd documents and in that case the connector goes to the base type which is String.

Ross

Generated at Thu Feb 08 09:05:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.