We are trying to do copy existing data in huge collections(around 6 million documents). our requirement is such that we need a specific set of data and not all data. so in the configuration, we provide pipeline similar to:
"pipeline": "[ { $project: { "updateDescription":0 } }, { $match: {"fullDocument.createdDate":{ "$gt": ISODate("2019-03-31T13:44:54.791Z"), "$lt": ISODate("2020-07-23T13:44:54.791Z")} } } ]".
Mongodb logs show the lookup seems to be very expensive. From the connector code, it looks up the entire collection and applies the filter https://github.com/mongodb/mongo-kafka/blob/master/src/main/java/com/mongodb/kafka/connect/source/MongoCopyDataManager.java#L147 The pipeline configuration is added at the end so it looks up the entire collection and applies the data. Is there an option or a way to add the provided pipeline configuration at the beginning of the list.
Also, please provide us other configuration option available to make the copy data effective. Thanks
- is duplicated by
-
KAFKA-150 When copy exist, support config pipeline before default replaceRoot pipeline
- Closed