[KAFKA-366] Parallel bulk writes from sink connector Created: 27/Apr/23  Updated: 14/Aug/23

Status: Backlog
Project: Kafka Connector
Component/s: Sink
Affects Version/s: None
Fix Version/s: 1.12.0

Type: New Feature Priority: Unknown
Reporter: Martin Andersson Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Quarter: FY24Q3

 Description   

In com.mongodb.kafka.connect.sink.StartedMongoSinkTask#put a collection of records is grouped into batches of writes by namespace (i.e. mongoDB database and collection name). However, this list of distinct batches are then written to MongoDB in serial.

This means that you will get a large drop in performance if

  1. your sink connector consumes from multiple topics
    or
  2. you add transforms that split data from one topic into multiple collections

 

My team first noticed this issue during a data rate spike that caused the connector to lag behind by over an hour.

We should be able to do these bulk writes in parallel with a thread pool (with a configurable pool size) . Since each batch write is to a separate collection, ordering will not be impacted.



 Comments   
Comment by Martin Andersson [ 02/May/23 ]

robert.walters@mongodb.com i had a read and the configuration options mentioned in this article does not address the issues mentioned in the ticket description; Within a task, if that tasks consumes from multiple topics (or consumed records are mapped to multiple mongoDB namespaces), then the max.batch.size configuration option does not apply. Naturally, records being written to different mongoDB collections are grouped into to separate batches.

 

Comment by Robert Walters [ 01/May/23 ]

Hi martin.andersson@kambi.com please review https://www.mongodb.com/developer/products/connectors/tuning-mongodb-kafka-connector/ this blog post under Sink there are some recommendations to improve sink write performance specifically settings tasks.max property.  

Generated at Thu Feb 08 09:06:13 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.