[SERVER-38720] 3-node replica set: periodically load avg above 100 on primary, unable to answer queries Created: 20/Dec/18  Updated: 20/Dec/18  Resolved: 20/Dec/18

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Frank Steinborn Assignee: Danny Hatcher (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File mongo1.png     PNG File mongo2.png     PNG File mongo3.png    
Participants:

 Description   

Hi,

 

we're running a 3-node replica set with MongoDB version 4.0.1. Until recently we have been running the same data set on a replica set with version 2.4 and we have seen the same issue.

 

Once in a while we suddenly see load spiking on the primary node and active reads piling up. See attached screenshot from our Grafana dashboard. When this happens, the cluster is unable to answer queries at all - the short-hand solution is to either rs.StepDown() or restart the mongod on the primary completely.

 

We want to ask for input on how to go from here to debug this. We couldn't spot a query yet which seems suspect to cause this. The replica set was running fine for years before the issue first appeared a few month ago and we're unsure what is causing this.

 

Attached are MongoDB metrics and host metrics where the problem can be seen.

 

Thanks!



 Comments   
Comment by Danny Hatcher (Inactive) [ 20/Dec/18 ]

Hello Frank,

Thanks for your report. Please note that SERVER project is for reporting bugs or feature suggestions for the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag. See also our Technical Support page for additional support resources.

I recommend ensuring that your server configuration matches our Production Notes as there may be some easy changes to make. Additionally, you can look through your logs to see if there are any queries that have a high nscanned / nreturned ratio as those queries likely could benefit from index optimization. You also may wish to try increasing the amount of RAM available on the machines in question as high cache use is a frequent performance issue.

Thank you,

Danny

Generated at Thu Feb 08 04:49:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.