[SERVER-53738] Flow control is engaged and the sustainer point is not moving. Please check the health of all secondaries Created: 13/Jan/21  Updated: 11/Feb/21  Resolved: 11/Feb/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.2.8
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Sourabh Ghosh Assignee: Dmitry Agranat
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2021-01-20 at 17.18.41.png     PNG File Screen Shot 2021-01-20 at 17.19.14.png    
Operating System: ALL
Participants:

 Description   

Hello Team,
recently we had faced an issue on our production system on 8th Jan 2021 between [16:20-16:40 IST]  wherein one of our mongod[Primary]  instance stop responding version we are using is 4.2.8.
we were able to login to primary instance but  it was getting stuck on 
show dbs/any command   there was no re-index/index building going on which can cause the lock where we checked the logs and we found something 
Flow control is engaged and the sustainer point is not moving. Please check the health of all secondaries
Unable to gather storage statistics for a slow operation due to lock aquire timeout
it remains for almost 10-15 mins in that state post which everything goes healthy.
can you kindly throw some light what could be causing this issue and let us know anything required from our side.



 Comments   
Comment by Dmitry Agranat [ 11/Feb/21 ]

Hi sourabh.ghosh@visiblealpha.com, since it's been over a month since the original issue was reported, I will go ahead and close this case. If this re-occurs, please collect the requested information and open a new ticket and we'll be happy to investigate.

Regards,
Dima

Comment by Sourabh Ghosh [ 27/Jan/21 ]

Thanks Dimitry for giving your time and info.

Comment by Dmitry Agranat [ 26/Jan/21 ]

sourabh.ghosh@visiblealpha.com, the affected Primary was responding from the networking perspective. The condition for an election "the secondary members losing connectivity to the primary for more than the configured timeout (10 seconds by default)" was not met.

There is no need to enable more verbose logging at this time.

Comment by Sourabh Ghosh [ 22/Jan/21 ]

Yes Dimitry,
This occured for the very first time where in Primary was not responding  but again the question remains suppose if it is not responding so one of secondary should be get promoted to primary(there has tobe election take place in such scenario).

and Yes we will get more info on collecting the currentOp and lockinfo details going forward if it occurs and one more thing to know should we enable or "increase the log verbosity level as a whole or any particular component level" to capture log more in details ? any drawbacks because we have massive write and read intensive workload. 

Comment by Dmitry Agranat [ 21/Jan/21 ]

Hi sourabh.ghosh@visiblealpha.com, after analyzing the provided data, the event that you've reported might have happened due to some operation holding an intent lock for an extended period without yielding (SERVER-53180 would have been useful in this situation). It is not obvious what that operation was during this event as several significant operations were executed at the same time.

The blocking event appears to have started coinciding with a createIndexes, catalogCache refresh and a recvChunk operations. This might be due to operations on one collection being blocked waiting for an incremental refresh that in turn is not progressing, or is progressing slowly. However, due to not enough information, we cannot determine this at this point.

It will be helpful to collect more data the next time this happens. Specifically, collecting currentOp and lockInfo during such an event would provide us with more information about what databases and collections the operations are trying to take locks on and what locks they hold. If this is a reoccurring event, you could put in place a monitor to collect currentOp and lockInfo periodically, say every 10 or 30 sec. You can archive, rotate and delete previously collected data if it takes days to reproduce. All we need to save is the one covering the reported event.

Was this a one-time event or does this occur periodically?

Comment by Sourabh Ghosh [ 20/Jan/21 ]

Thanks Dimitry for the update ...

Comment by Dmitry Agranat [ 19/Jan/21 ]

Thanks sourabh.ghosh@visiblealpha.com, I am still investigating this and will have an update for you later this week

Comment by Sourabh Ghosh [ 18/Jan/21 ]

Hi Dimitry,
 We are using percona-server-mongodb-4.2.8-8-r389dde5-bionic-x86_64-bundle.tar this binary. 

Comment by Dmitry Agranat [ 18/Jan/21 ]

Hi sourabh.ghosh@visiblealpha.com, thank you for providing the requested information.

Could you please clarify what the 4.2.8-8 binary stands for?

Comment by Sourabh Ghosh [ 14/Jan/21 ]

Thanks Dimitry,

I have uploaded all the mongod and diagnostics logs on the above link as you mentioned 
below are the file and folder names as well kindly ignore any other files and folder's if you find.
Primary mongod files and folders diagnostic.data_pri.gz/mongod.log_pri.gz
Secondary mongod files and folders diagnostic.data_sec.gz/mongod.log_sec.gz
Secondary2 mongod files and folders diagnostic.data_sec2.gz/mongod.log_sec2.gz
Let us know if anymore details you required from our end.

 

Comment by Dmitry Agranat [ 13/Jan/21 ]

Hi sourabh.ghosh@visiblealpha.com,

Would you please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) from all members of the replica set and upload them to this support uploader location?

Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Thanks,
Dima

Generated at Thu Feb 08 05:31:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.