[SERVER-38720] 3-node replica set: periodically load avg above 100 on primary, unable to answer queries Created: 20/Dec/18 Updated: 20/Dec/18 Resolved: 20/Dec/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Question | Priority: | Major - P3 |
| Reporter: | Frank Steinborn | Assignee: | Danny Hatcher (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Participants: |
| Description |
|
Hi,
we're running a 3-node replica set with MongoDB version 4.0.1. Until recently we have been running the same data set on a replica set with version 2.4 and we have seen the same issue.
Once in a while we suddenly see load spiking on the primary node and active reads piling up. See attached screenshot from our Grafana dashboard. When this happens, the cluster is unable to answer queries at all - the short-hand solution is to either rs.StepDown() or restart the mongod on the primary completely.
We want to ask for input on how to go from here to debug this. We couldn't spot a query yet which seems suspect to cause this. The replica set was running fine for years before the issue first appeared a few month ago and we're unsure what is causing this.
Attached are MongoDB metrics and host metrics where the problem can be seen.
Thanks! |
| Comments |
| Comment by Danny Hatcher (Inactive) [ 20/Dec/18 ] |
|
Hello Frank, Thanks for your report. Please note that SERVER project is for reporting bugs or feature suggestions for the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag. See also our Technical Support page for additional support resources. I recommend ensuring that your server configuration matches our Production Notes as there may be some easy changes to make. Additionally, you can look through your logs to see if there are any queries that have a high nscanned / nreturned ratio as those queries likely could benefit from index optimization. You also may wish to try increasing the amount of RAM available on the machines in question as high cache use is a frequent performance issue. Thank you, Danny |