[KAFKA-93] Connector issue with high load: Query failed with error code 136 and error message 'Error in $cursor stage :: caused by :: errmsg: "CollectionScan died due to failure to restore tailable cursor position Created: 30/Mar/20  Updated: 10/Jun/20  Resolved: 07/Apr/20

Status: Closed
Project: Kafka Connector
Component/s: None
Affects Version/s: 1.1
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Sabari Gandhi Assignee: Ross Lawley
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

MongoDB: 3.6.17, Kafka Connector 1.1


Attachments: File connector-config.json    
Issue Links:
Duplicate
duplicates KAFKA-76 Reuse the postBatchResumeToken Closed
Case:

 Description   

Hi, I am using MongoDB: 3.6.17 in my setup. since MongoDB version < 4.0 does not support change streams in database level. I am registering multiple connector instances on my setup. I have 10 connector instances listening to 10 collections in my setup. I try to load around 5000 documents in the collections. After some time I start seeing failures like. and the connector fails and the related Kafka topic does not only partial messages.

Also when I look into the connector status in curl -X GET http://localhost:8083/connectors/mongo-source-2/status the result says its running. but the connector is in a failed state.

 

2020-03-30 17:00:15,513] WARN [Producer clientId=confluent.monitoring.interceptor.connector-producer-mongo-source-7-0] Error while fetching metadata with correlation id 3 : {_confluent-monitoring=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
com.mongodb.MongoQueryException: Query failed with error code 136 and error message 'Error in $cursor stage :: caused by :: errmsg: "CollectionScan died due to failure to restore tailable cursor position. Last seen record id: RecordId(6810047002907247540)"' on server mongo1:27017
com.mongodb.MongoQueryException: Query failed with error code 136 and error message 'Error in $cursor stage :: caused by :: errmsg: "CollectionScan died due to position in capped collection being deleted. Last seen record id: RecordId(6810046998612279328)"' on server mongo1:27017
com.mongodb.MongoQueryException: Query failed with error code 136 and error message 'Error in $cursor stage :: caused by :: errmsg: "CollectionScan died due to failure to restore tailable cursor position. Last seen record id: RecordId(6810047020087116556)"' on server mongo1:27017
com.mongodb.MongoQueryException: Query failed with error code 136 and error message 'Error in $cursor stage :: caused by :: errmsg: "CollectionScan died due to position in capped collection being deleted. Last seen record id: RecordId(6810046998612281250)"' on server mongo1:27017
com.mongodb.MongoQueryException: Query failed with error code 136 and error message 'Error in $cursor stage :: caused by :: errmsg: "CollectionScan died due to position in capped collection being deleted. Last seen record id: RecordId(6810047015792149212)"' on server mongo1:27017
com.mongodb.MongoQueryException: Query failed with error code 136 and error message 'Error in $cursor stage :: caused by :: errmsg: "CollectionScan died due to position in capped collection being deleted. Last seen record id: RecordId(6810047020087116658)"' on server mongo1:27017



 Comments   
Comment by Sabari Gandhi [ 09/Apr/20 ]

Hi Ross, Thanks for the update. I have added myself in the watch list of  KAFKA-76 . Also, when you get a chance can you provide some information about the release cycle planned for the connectors. Thanks in advance.

Comment by Ross Lawley [ 07/Apr/20 ]

Hi sabari.mgn@gmail.com,

Thanks for the ticket. The issue here is the connector has seen an offset but cannot restart the change stream at that resume token. Currently, the only work around is to clear the offsets stored in Kafka.

Please watch the KAFKA-76 ticket which will use the high water mark to help insure this doesn't happen.

Ross

Comment by Sabari Gandhi [ 30/Mar/20 ]

I have attached configuration I have for one of my connector instances. Please read my description as: I try to load around 50000 documents in the collections ( i am not able to edit the description to fix the typo). 

Because of the error above the connector is not able to recover and I see the below messages

[2020-03-30 20:27:44,880] INFO Failed to resume change stream: resume of change stream was not possible, as the resume token was not found. {_data: BinData(0, "825E8254BF000008E9461E5F6964002C34B0005A10046533550540C240D080F0DE7DA09C792704"), _typeBits: BinData(0, "01")} 280 (com.mongodb.kafka.connect.source.MongoSourceTask)

 

But the curl command for the connector status returns running. I am able to reproduce the scenario with 2 connector instances with high load. Please let me know for any information. Thanks in advance.

Generated at Thu Feb 08 09:05:34 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.