[SERVER-32465] Abnormally high (100%) CPU on secondary instance Created: 27/Dec/17  Updated: 27/Oct/23  Resolved: 16/Feb/18

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.2.12
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Johnny Shields Assignee: Kelsey Schubert
Resolution: Gone away Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File 2017122709_mongod.log.gz     PNG File Screen Shot 2017-12-27 at 16.25.29.png     PNG File Screen Shot 2017-12-27 at 16.25.56.png     PNG File Screen Shot 2017-12-27 at 16.26.52.png     PNG File Screen Shot 2017-12-27 at 16.27.22.png     File diagnostic_data_04db.tar.gz     File diagnostic_data_05f4.tar.gz     File diagnostic_data_0ddb.tar.gz     File diagnostic_data_0f6a.tar.gz     PNG File image-2017-12-27-19-58-29-795.png     PNG File image-2017-12-27-19-58-33-746.png     File mongoparsed.log    
Operating System: ALL
Participants:

 Description   

We experienced the following on our production replica set: 3 secondaries 1 primary, MongoDB 3.2.12 + WiredTiger.

1. All secondaries saw a spike in disk usage (about 4-5x base load, we see something like this about twice per day randomly)
2. One of the secondary's mongod CPU usage shot up to 100%. At this time the replica set continued to route queries to the affected instance, but the instance did not process them.
3. We killed all queries on the affected instance via console, the CPU remained at 100% it continued to NOT process queries.
4. Finally we removed the affected process from the replica set and again we could process all queries.

Here is a CPU chart of the 4 instances. As you can see CPU of one of the instances hit 100%. (As it was not responding to queries, our app failed and stopped sending new queries, so other instances went to low CPU)

Here is a chart showing the disk usage spike on the affected instance which happened before the high CPU.

Unfortunately we did not get a core dump of the affected process.



 Comments   
Comment by Kelsey Schubert [ 16/Feb/18 ]

Thanks for the update, johnnyshields. Given the infrequency of this issue, I'm going to close this ticket. If you encounter it again on 3.4, please let us know and we can continue to investigate.

Comment by Johnny Shields [ 05/Feb/18 ]

Upgraded to 3.4 and no issue so far, but its very rare (happened once in 2+ years of running MongoDB 3.2)

Comment by Kelsey Schubert [ 05/Feb/18 ]

Hi johnnyshields,

Upgrading to a larger host may resolve this issue, however since we haven't completed an RCA, I cannot guarantee that it will reduce the likelihood of a recurrence. Would you please let us know if this is still an issue for you after upgrading to the latest version of MongoDB 3.4?

Thanks,
Kelsey

Comment by Johnny Shields [ 05/Jan/18 ]

Hi Kelsey, thank you for the clarification and understood. For avoidance of doubt, upgrading to a host with more CPUs (e.g. 4) would be a workaround that would reduce likelihood of recurrence?

Comment by Kelsey Schubert [ 05/Jan/18 ]

Hi johnnyshields,

I'm familiar with SERVER-30189 and can confirm that it affects MongoDB 3.2. In my previous comment, I suggested that you upgrade to the latest version of MongoDB 3.4 to take advantage of this fix as a number of the queries recorded in your logs have large $in expressions.

Given the relatively small size of this host (2 cpus), it is more susceptible to connection escalations if the application sends queries faster than it can process. SERVER-30189 is a known issue that may cause throughput to decrease unexpectedly resulting in this type of behavior. That said, there is not enough information at this time to determine whether or not upgrading will resolve this issue.

However, I believe upgrading is the best way to progress this investigation for two reasons. First, upgrading to 3.4.10, will allow us to quickly determine whether this fix (which we have identified specifically as affecting your workload) resolves this issue. Second, MongoDB 3.4 contains a number of other performance improvements as well additional diagnostics, which would enable us to better understand what is going on if the issue was not resolved. To that end, I included instruction in my previous comment about the type of diagnostic information we would need if issue continued after upgrading.

Kind regards,
Kelsey

Comment by Johnny Shields [ 03/Jan/18 ]

Hi Kelsey, one more question. I browsed the code differences and it seems that there's been quite a bit of refactoring in this area between 3.2 and 3.4. It would be helpful if someone who is very familiar with the code and refactoring done can comment if this issue can in fact occur on 3.2?

Comment by Johnny Shields [ 03/Jan/18 ]

Hi Kelsey, is it correct to say that it appears due to the underlying issue of SERVER-30189 the affected instance had all threads go into a "dead spin-lock" state (similar to a deadlock but with CPU being consumed)?

Also, its not entirely clear to me how SERVER-30189 definitively fixes the issue.

We will try to reproduce the issue by bombarding the instance with $in queries.

Comment by Kelsey Schubert [ 02/Jan/18 ]

Hi johnnyshields and MiLk,

Thanks for providing diagnostic.data and logs. After reviewing the logs, I see a few very large $in queries which occur around the event. Would you please upgrade to the latest version of MongoDB 3.4 to take advantage of SERVER-30189 and let us know if the issue is resolved?

If the issue is not resolved after upgrading, please collect perf during the event:

# record call stack samples and generate text output on test node
perf record  -a -g -F 99 sleep 60
perf script >perf.txt

We'll want diagnostic.data, logs, and the perf.txt file. Note that in MongoDB 3.4, diagnostic.data also contains disk and cpu utilization metrics, which also help us better understand what is going on here.

Thanks again for you help,
Kelsey

Comment by Emilien Kenler [X] [ 27/Dec/17 ]

Hi Kelsey,

I've uploaded the diagnostic data from the 4 servers of the replica set, and the logs for the hour around the incident (around 09:33 UTC).
The primary is mongodb001-05f4 and the affected server is mongodb001-04db.

Please let me know if you need additional information.

Comment by Kelsey Schubert [ 27/Dec/17 ]

Hi johnnyshields,

Thanks for reporting this issue. So we can continue to investigate, would you please upload the following information?

  • complete mongod logs from the primary, affected secondary, and an unaffected secondary
  • an archive of the diagnostic.data directory from the primary, affected secondary, and an unaffected secondary

Thank you,
Kelsey

Generated at Thu Feb 08 04:30:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.