[KAFKA-366] Parallel bulk writes from sink connector Created: 27/Apr/23 Updated: 14/Aug/23 |
|
| Status: | Backlog |
| Project: | Kafka Connector |
| Component/s: | Sink |
| Affects Version/s: | None |
| Fix Version/s: | 1.12.0 |
| Type: | New Feature | Priority: | Unknown |
| Reporter: | Martin Andersson | Assignee: | Unassigned |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Quarter: | FY24Q3 |
| Description |
|
In com.mongodb.kafka.connect.sink.StartedMongoSinkTask#put a collection of records is grouped into batches of writes by namespace (i.e. mongoDB database and collection name). However, this list of distinct batches are then written to MongoDB in serial. This means that you will get a large drop in performance if
My team first noticed this issue during a data rate spike that caused the connector to lag behind by over an hour. We should be able to do these bulk writes in parallel with a thread pool (with a configurable pool size) . Since each batch write is to a separate collection, ordering will not be impacted. |
| Comments |
| Comment by Martin Andersson [ 02/May/23 ] |
|
robert.walters@mongodb.com i had a read and the configuration options mentioned in this article does not address the issues mentioned in the ticket description; Within a task, if that tasks consumes from multiple topics (or consumed records are mapped to multiple mongoDB namespaces), then the max.batch.size configuration option does not apply. Naturally, records being written to different mongoDB collections are grouped into to separate batches.
|
| Comment by Robert Walters [ 01/May/23 ] |
|
Hi martin.andersson@kambi.com please review https://www.mongodb.com/developer/products/connectors/tuning-mongodb-kafka-connector/ this blog post under Sink there are some recommendations to improve sink write performance specifically settings tasks.max property. |