[SERVER-53858] Replica sets are going into recovery mode while running calls at 20k TPS Created: 18/Jan/21  Updated: 24/Feb/21  Resolved: 11/Feb/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Azhar Yousuf Assignee: Dmitry Agranat
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Steps To Reproduce:
  1. Run calls at 20k TPS
  2. Some replica sets will go into recovery mode
Participants:

 Description   

Our setup is capable of running calls at 50-60k TPS and during this some of our replica sets were going into recovery mode and they couldn't sync with the primary. I know that I can recover my system by stopping mongo, deleting data sets and then starting it back on will sync with primary as per the guide https://docs.mongodb.com/manual/tutorial/resync-replica-set-member/

We saw the below error message on our logs 

Resync is needed for secondary member SESSION-SET12:set01j2:vm01:27737 this member is lagging behind by 14747 seconds from the primary

Recovering system is not an issue but would like to know why some of our replica sets are going into recovery mode in the first place and request you to kindly assist

Mongo version
[root@vm01 ~]# rpm -qa | grep mongo
mongodb-org-3.6.17-1.el8.x86_64
mongodb-org-mongos-3.6.17-1.el8.x86_64
mongodb-org-tools-3.6.17-1.el8.x86_64
mongodb-org-server-3.6.17-1.el8.x86_64
mongodb-org-shell-3.6.17-1.el8.x86_64a

Thanks and Regards,

Azhar



 Comments   
Comment by Azhar Yousuf [ 24/Feb/21 ]

okay we'll try to hit the same issue and will collect all the logs. 

Thanks and Regards,
Azhar

Comment by Dmitry Agranat [ 11/Feb/21 ]

Hi rizwiazhar@gmail.com,

We haven’t heard back from you for some time, so I’m going to close this ticket. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Regards,
Dima

Comment by Dmitry Agranat [ 28/Jan/21 ]

Hi rizwiazhar@gmail.com, all the provided data does not cover the time of the event you've mentioned, what we have now is some partial data a week after the event has occurred. This makes it difficult to understand what has caused some members to go into recovery mode.

If you can reproduce this issue, I recommend collecting all the data mentioned here, for all members of the cluster and uploading a fresh set of data.

Thanks,
Dima

Comment by Azhar Yousuf [ 21/Jan/21 ]

Hi Dima,
    I only attached the logs which got stuck in recovery mode. And the issue occurred on 2021-Jan-13 and somewhere around 6:00 to 8:00. However I could see some errors on heartbeat request at 10:00 which are shown below 

Also these are the name of the servers which got stuck in recovery mode

Member-4 - 27717 :  - RECOVERING - vm15- ON-LINE - 16 hr - 3
Member-3 - 27737 :  - RECOVERING - vm15- ON-LINE - 1 days - 4
Member-3 - 27717 :  - RECOVERING - vm20- ON-LINE - 1 days - 2
Member-1 - 27717 :  - RECOVERING - vm22- ON-LINE - 1 days - 4
Member-2 - 27757 :  - RECOVERING - vm20- ON-LINE - 9 hr - 2

 

Thanks and Regards,

Azhar

Comment by Dmitry Agranat [ 20/Jan/21 ]

Hi rizwiazhar@gmail.com, I did not see any issue with the data you have uploaded. Could you please clarify:

  • Where the issue occurred? What is the name of the server that went into recovery? What was the name of the Primary at that time?
  • When the issue occurred? Timestamps and timezone for the start and end of the event of the event

Please note that the diagnostic.data for nd5bwa5psm22va is missing.

Thanks,
Dima

Comment by Azhar Yousuf [ 20/Jan/21 ]

Hi Dima, 
   Our environment is quite big and we have a total of 51 sets. So currently I have uploaded only the diagnostics data from the previous uploaded affected sets which were stuck in recovery mode. Kindly let me know if these logs are sufficient enough

Thanks and Regards,

Azhar

Comment by Azhar Yousuf [ 20/Jan/21 ]

Hi Dima,
   Currently I have uploaded the mongodb logs from the replica sets which were stuck in recovery state. I will collect the diagnostics data and all the replica set longs in some more time, because our setup is currently busy at the moment 

Thanks and Regards,

Azhar

Comment by Dmitry Agranat [ 19/Jan/21 ]

Hi rizwiazhar@gmail.com,

Would you please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) from all members of the replica set and upload them to this support uploader location?

Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Thanks,
Dima

Generated at Thu Feb 08 05:32:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.