[KAFKA-117] Is it possible to implement copying new collections if pipeline has been changed? Created: 18/Jun/20  Updated: 02/Jun/22  Resolved: 10/Aug/20

Status: Closed
Project: Kafka Connector
Component/s: Source
Affects Version/s: None
Fix Version/s: None

Type: New Feature Priority: Major - P3
Reporter: Andrey B Assignee: Ross Lawley
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

Use case workflow:

Create connector with config:

"pipeline": "[{\"$match\": {\"ns.coll\": {\"$regex\": /^(col1|col2)$/}}}]",
"copy.existing": "true",
... 

After some time update pipeline in connector's config to:

"[{\"$match\": {\"ns.coll\": {\"$regex\": /^(col1|col2|col3)$/}}}]"

Desired result after restart:

  • save resume token
  • somehow understand that need to copy only documents from 'col3' collection
  • copy documents from 'col3'
  • start streaming from saved resume token

 

What do you think about it?



 Comments   
Comment by Andrey B [ 24/Aug/20 ]

I created a separate ticket for the last question. KAFKA-147

Comment by Andrey B [ 10/Aug/20 ]

I think at the moment reconfiguring the connector requires too much state to be stored.

I agree.

If you wish to add a new collection and copy the existing data over the process should be something like:

1) Add a new connector to copy and monitor the new collection
2) Once the data copying process has finished and normal change stream events are being published, stop the new connector
3) Reconfigure the existing connector to include the newly added collection.

This approach could lead to data gaps.

 

What do you think about explicitly configuration which collections should be copied? I don't speak about saving state and checking if there are new collections that should be copied. Just config property which defines which collections should be copied at the beginning of work.

 

Andrey

Comment by Ross Lawley [ 10/Aug/20 ]

Hi andreworty@gmail.com,

I think at the moment reconfiguring the connector requires too much state to be stored.

If you wish to add a new collection and copy the existing data over the process should be something like:

1) Add a new connector to copy and monitor the new collection
2) Once the data copying process has finished and normal change stream events are being published, stop the new connector
3) Reconfigure the existing connector to include the newly added collection.

That is probably more efficient than running lots of change stream cursors and connectors and allows for the growth of watching and copying new collections.

I'm going to close this ticket for now as "Won't Fix" however, should more people require this functionality and comment on this ticket we can always reopen it in the future.

Ross

Comment by Andrey B [ 08/Aug/20 ]

Hi again, Ross,
What do you think about it?

A little bit more about my case:
I have a couple of thousand collections and streaming will be turned on gradually, not all at once. I also need to copy the existing data for new collections. Usually, these collections are quite small, a few hundred or thousands of documents. So, I guess, creating a new connector every time when I want to start streaming new collections should work for me.

What do you think about copy.existing.collections or copy.existing.collection.regex parameters?
There could be a pipeline, which ignores certain collections, but code MongoCopyDataManager.copyDataFrom() could take some time for really big collections and it will be wasted. In my case for a single collection could be wasted up to a few minutes, but I guess there is could be more.

 

It will be great to define explicitly to which collections should be copied.

Comment by Andrey B [ 30/Jun/20 ]

Hi Ross,

thanks for reply

Due to the pipeline possibly containing any valid pipeline operation, it would be hard to determine if any new collections existed.

Maybe it's better to add new parameters to connector config, like copy.existing.collections or copy.existing.collection.regex

Also where to keep the metadata about what had already been seen / processed.

I thought about some special Kafka topic.

I think for the level of complexity it would add, registering a new connector instance would potentially be the simplest solution.

I agree it's a much simpler solution.
Maybe you could prompt how many connector instances we could create? e.g. there would be performance downgrade on mongo itself if we create 2000 instances? what about 10000?

Andrey

Comment by Ross Lawley [ 30/Jun/20 ]

Hi andreworty@gmail.com,

Thanks for the ticket. Due to the pipeline possibly containing any valid pipeline operation, it would be hard to determine if any new collections existed. Also where to keep the metadata about what had already been seen / processed.

I think for the level of complexity it would add, registering a new connector instance would potentially be the simplest solution.

Ross

Generated at Thu Feb 08 09:05:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.