[SERVER-32465] Abnormally high (100%) CPU on secondary instance Created: 27/Dec/17 Updated: 27/Oct/23 Resolved: 16/Feb/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 3.2.12 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Johnny Shields | Assignee: | Kelsey Schubert |
| Resolution: | Gone away | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Operating System: | ALL |
| Participants: |
| Description |
|
We experienced the following on our production replica set: 3 secondaries 1 primary, MongoDB 3.2.12 + WiredTiger. 1. All secondaries saw a spike in disk usage (about 4-5x base load, we see something like this about twice per day randomly) Here is a CPU chart of the 4 instances. As you can see CPU of one of the instances hit 100%. (As it was not responding to queries, our app failed and stopped sending new queries, so other instances went to low CPU) Here is a chart showing the disk usage spike on the affected instance which happened before the high CPU. Unfortunately we did not get a core dump of the affected process. |
| Comments |
| Comment by Kelsey Schubert [ 16/Feb/18 ] | |||
|
Thanks for the update, johnnyshields. Given the infrequency of this issue, I'm going to close this ticket. If you encounter it again on 3.4, please let us know and we can continue to investigate. | |||
| Comment by Johnny Shields [ 05/Feb/18 ] | |||
|
Upgraded to 3.4 and no issue so far, but its very rare (happened once in 2+ years of running MongoDB 3.2) | |||
| Comment by Kelsey Schubert [ 05/Feb/18 ] | |||
|
Hi johnnyshields, Upgrading to a larger host may resolve this issue, however since we haven't completed an RCA, I cannot guarantee that it will reduce the likelihood of a recurrence. Would you please let us know if this is still an issue for you after upgrading to the latest version of MongoDB 3.4? Thanks, | |||
| Comment by Johnny Shields [ 05/Jan/18 ] | |||
|
Hi Kelsey, thank you for the clarification and understood. For avoidance of doubt, upgrading to a host with more CPUs (e.g. 4) would be a workaround that would reduce likelihood of recurrence? | |||
| Comment by Kelsey Schubert [ 05/Jan/18 ] | |||
|
Hi johnnyshields, I'm familiar with Given the relatively small size of this host (2 cpus), it is more susceptible to connection escalations if the application sends queries faster than it can process. However, I believe upgrading is the best way to progress this investigation for two reasons. First, upgrading to 3.4.10, will allow us to quickly determine whether this fix (which we have identified specifically as affecting your workload) resolves this issue. Second, MongoDB 3.4 contains a number of other performance improvements as well additional diagnostics, which would enable us to better understand what is going on if the issue was not resolved. To that end, I included instruction in my previous comment about the type of diagnostic information we would need if issue continued after upgrading. Kind regards, | |||
| Comment by Johnny Shields [ 03/Jan/18 ] | |||
|
Hi Kelsey, one more question. I browsed the code differences and it seems that there's been quite a bit of refactoring in this area between 3.2 and 3.4. It would be helpful if someone who is very familiar with the code and refactoring done can comment if this issue can in fact occur on 3.2? | |||
| Comment by Johnny Shields [ 03/Jan/18 ] | |||
|
Hi Kelsey, is it correct to say that it appears due to the underlying issue of Also, its not entirely clear to me how We will try to reproduce the issue by bombarding the instance with $in queries. | |||
| Comment by Kelsey Schubert [ 02/Jan/18 ] | |||
|
Hi johnnyshields and MiLk, Thanks for providing diagnostic.data and logs. After reviewing the logs, I see a few very large $in queries which occur around the event. Would you please upgrade to the latest version of MongoDB 3.4 to take advantage of If the issue is not resolved after upgrading, please collect perf during the event:
We'll want diagnostic.data, logs, and the perf.txt file. Note that in MongoDB 3.4, diagnostic.data also contains disk and cpu utilization metrics, which also help us better understand what is going on here. Thanks again for you help, | |||
| Comment by Emilien Kenler [X] [ 27/Dec/17 ] | |||
|
Hi Kelsey, I've uploaded the diagnostic data from the 4 servers of the replica set, and the logs for the hour around the incident (around 09:33 UTC). Please let me know if you need additional information. | |||
| Comment by Kelsey Schubert [ 27/Dec/17 ] | |||
|
Hi johnnyshields, Thanks for reporting this issue. So we can continue to investigate, would you please upload the following information?
Thank you, |