[KAFKA-247] Recreate change stream from the point of failure for event > 16 MB Created: 12/Aug/21  Updated: 01/Sep/23  Resolved: 25/Jul/23

Status: Closed
Project: Kafka Connector
Component/s: Source
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Unknown
Reporter: Dhruvang Makadia Assignee: Unassigned
Resolution: Won't Fix Votes: 0
Labels: external-user
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Design
design is described in KAFKA-381 Support change stream split large events Backlog
Related
related to SERVER-55062 Change stream events can exceed 16MB ... Closed
related to KAFKA-381 Support change stream split large events Backlog

 Description   

When a change event in the change stream exceeds the 16MB limit, existing change stream is closed with an exception and new change stream is opened. In a system with a higher update load this will likely miss the change events in the time it takes to start a new change stream. I have 2 proposal for improvement.

Solution #1

The error message of the exception contains the resumeToken of the failed event. Use the "ChangeStream.startAfter(<resumeToken>)" to start the new stream just after the failed event, leading to zero loss of events.

Example error message

BSONObj size: 19001449 (0x121F069) is invalid. Size must be between 0 and 16793600(16MB) First element: _id: { _data: "826115AEE9000000012B022C0100296E5A1004D148D6B22E8F49B3A65DAE80A4683566463C5F6964003C316663726B36326F6D30303030303030000004" }

Solution #2

Increment the "clusterTime" (introduced in v4.0) available in the MongoCommandException, by 1 ordinal and use it with "ChangeStream.startAtOperationTime(<BsonTimestamp>)"

For sharded cluster, it is possible that multiple events may have same cluster time and this approach can skip few good events with same timestamp as the bad one.



 Comments   
Comment by Robert Walters [ 25/Jul/23 ]

Handling of large messages will be implemented with KAFKA-381

Comment by Ross Lawley [ 25/Jul/23 ]

I think this ticket should be closed as "Won't fix" as we cannot resume a change stream from the point of failure.

Recommend directing users to KAFKA-381

Comment by Ross Lawley [ 17/Aug/21 ]

Hi dhruvangmakadia1@gmail.com,

The last seen resume token is stored as the offset. So resiliency is there for other events as the connector will continue after the last seen event. It's just this exception is non resumable as the last consumed event occurs before the too large event and the change stream if restarted at the last seen (processed) event would continue to see the same error.

So the challenge is to capture the message too large error and process it differently to other errors (essentially skip that event). However, it will depend on users configuration as missing that event will result in data loss. The only way to ensure no data loss would be to restart and go through the copy data process.

Ross

Comment by Dhruvang Makadia [ 16/Aug/21 ]

Hi Ross Lawley,

Although I did the investigation and filed the ticket just for large event exception, I wonder if similar improvement can be made for other exceptions resulting in change stream exception as well. In an ideal world, we would like to have no data loss between kafka and updates to mongo.

Comment by Ross Lawley [ 16/Aug/21 ]

Hi dhruvangmakadia1@gmail.com,

Thanks for the ticket. This is something we can look into improving.

Unfortunately, until SERVER-55062 is implemented a large change stream document will result in some data loss.

All the best,

Ross

Generated at Thu Feb 08 09:05:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.