[SERVER-25700] Very high CPU usage Created: 19/Aug/16  Updated: 27/Sep/16  Resolved: 26/Aug/16

Status: Closed
Project: Core Server
Component/s: Stability, WiredTiger
Affects Version/s: 3.2.9
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Teemu Sirkiä Assignee: Kelsey Schubert
Resolution: Incomplete Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-26040 High CPU/IOWAIT in MongoDB 3.2.9 Closed
Operating System: ALL
Participants:

 Description   

I updated Mongo from version 3.2.5 to 3.2.9 and after that Mongo has freezed my system by using all the CPU time.

There is nothing in Mongo's own logs but I can see that my own process has done both times a number of updates. But there has not been anything special with those updates as similar updates are done every 10 seconds.

There was about 24 hours between these problems so it won't occur very frequently. However, I had to downgrade Mongo back to 3.2.5 to see if this issues goes away.

I tried to search any open issues but I couldn't find anything relevant. As this is my production server and there is not much information about what is wrong, this might be quite hard to debug what is happening.



 Comments   
Comment by Kelsey Schubert [ 27/Sep/16 ]

Hi colinhowe,

We've made a significant number of fixes in MongoDB 3.2.10, which we expect to release early next week, and we have heard that users are seeing substantial improvements (see SERVER-26055). Unfortunately, without the diagnostic.data, I cannot be more confident that you are encountering an issue that has been fixed in MongoDB 3.2.10. Nonetheless, I would recommend that you consider upgrading when it is released. If you encounter this issue again after upgrading, please open a new ticket and include the diagnostic.data so we can investigate.

Thank you for your help,
Thomas

Comment by Kelsey Schubert [ 27/Sep/16 ]

To watchers of this ticket, emilburzo opened SERVER-26040, where we determined the issue was related to change in how frequently journal files are flushed to disk.

Comment by Colin Howe [ 16/Sep/16 ]

Sorry I only just saw this message and the dates on the diagnostics indicate that they no longer have the bad period

Comment by Kelsey Schubert [ 08/Sep/16 ]

Hi colinhowe and emilburzo,

Thank you for letting us know that you are encountering a similar issue.

To help us diagnose this behavior, would you please each create a new ticket and attach an archive of your diagnostic.data directory?

Thank you,
Thomas

Comment by Emil Burzo [ 08/Sep/16 ]

Another me too post.

I won't copy/paste as I've already written about it here:

https://groups.google.com/forum/#!starred/mongodb-user/42wOHqR_o5Q

On the host with regular HDDs (not SSD) IOWAIT is very significant.

Comment by Colin Howe [ 07/Sep/16 ]

Hi,

We upgraded from 3.2.7 to 3.2.9 and saw similar problems to those described here. We've now downgraded to 3.2.8 and CPU usage is about 25% of what it was. Not only that, but, we were seeing problems with available tickets dropping to 0 and all queries taking ~10x as long as normal.

If you want to see this in MMS - https://cloud.mongodb.com/v2/56375ad1e4b09259595a25b3#host/replicaSet/56375cd6e4b007ecfe71df99

Something isn't right.

Comment by Kelsey Schubert [ 26/Aug/16 ]

Hi ttsirkia,

Unfortunately, after examining the logs and diagnostic.data we have not been able to determine the cause of this behavior. I expect that the issue you have encountered is hindering our ability to collect the necessary diagnostic metrics. If you are able to reproduce this issue please let us know and we will reopen this ticket and work with you to collect additional information to help us debug the problem.

Please be aware that we are actively working to improve the performance of WiredTiger across a variety of workloads and system architectures. It is likely that a future version of MongoDB will resolve this issue. Unfortunately, at this time, I cannot point you towards a specific ticket to watch for updates.

ravi@quantumgraph.com,

The behavior you are observing appears to be related to cache eviction. To help our investigation, would you please open a new ticket and attach an archive of $dbpath/diagnostic.data directory to it?

Thank you,
Thomas

Comment by Teemu Sirkiä [ 24/Aug/16 ]

It is kind of nice to hear that I'm not the only one having this issue and you Ravi can also confirm it.

The update queries might be the key of the issue as in my application, most of the queries are updates. Inserts and deletes are much more seldom. Fetching data of course occurs but not as much as updates.

-Teemu

Comment by Ravi Teja [ 24/Aug/16 ]

Hi All,
We too faced with the same issue.
We were initially using 3.2.6 and our cpu usage was not more than 30-40% and after updating to 3.2.9, with in a hour of start up it is almost always 100%. We downgraded the version back to 3.2.6 after a day of this issue. In between restarting the mongo machines in replica set had reduced the cpu utilization. we observed that lot of update queries is trigger for the issue.

Our cpu utilization is as showed below hourly basis of each day and random spikes at end is related to upgrading mongo

http://imgur.com/a/bpWt2

Thanks
Ravi

Comment by Teemu Sirkiä [ 23/Aug/16 ]

Hi!

I appriciate your work. I just uploaded the relevant Mongo logs.

Typically, top shows CPU usage between 0.3 - 2%. After these issues, I wasn't able to login anymore to the server but the virtualization monitor reported that the virtual engine running Mongo took 30% of the whole processor capacity. So I suspect that the CPU usage at that point was 100%.

As I wasn't able to login, I cannot be sure that it was Mongo's process that was using the CPU. However, the problem is still very tightly related to this update because the problems disappeared after downgrading. My system is running Node.js and using Mongoose to communicate with the database. I tried to search Mongoose and MongoDB driver bug trackers but found nothing interesting.

I would be happy to able to replicate this in my testing environment. If this occurred in that environment, I could incrementally upgrade Mongo towards 3.2.9 to see which version actually causes the problems. It is not very feasible option to use the production server for debugging purposes, unfortunately.

-Teemu

Comment by Kelsey Schubert [ 23/Aug/16 ]

Hi ttsirkia,

Thanks for providing the diagnostic.data. Unfortunately, we have not been able to determine the cause of this behavior yet. To continue to investigate, would you please upload the complete logs to the same portal covering the time period that this issue occurred?

Additionally, would you please clarify how much cpu is typically utilized by the mongod instance?

Thanks again,
Thomas

Comment by Teemu Sirkiä [ 21/Aug/16 ]

Some additional information. It only took about 10 hours when the first issue occurred after I updated to 3.2.9. And the second occurred 24 hours after that. Now after downgrading back to 3.2.5, the system has been running without any problems over 60 hours.

At the same time, I've tried to replicate the issue in my testing environment by using Mongo 3.2.9. However, no problems have occurred. The background processes are the same in the testing environment as in the production environment. The main difference is that the number of queries is much smaller because there are no users using the system. This might indicate that it could be some kind of race condition that occurs at certain point when the database is used by the background processes and the users via web interface.

I hope the diagnostics file bring up something interesting. Meanwhile, I'm running the older version and not going to upgrade.

Comment by Teemu Sirkiä [ 19/Aug/16 ]

Thanks! I uploaded all the files. The first issue occurred yesterday, Aug 18th, around 09:40:15 EEST and the second today, Aug 19th, around 11:03:35 EEST. Mongo was updated to version 3.2.9 on Aug 17th around 23:00 EEST and then downgraded back to 3.2.5 today after the issue.

BR,
Teemu

Comment by Kelsey Schubert [ 19/Aug/16 ]

Hi ttsirkia,

Providing the complete directory is preferable as it would give us more context around the issue. I've created a secure upload portal for you to use - would you please upload the diagnostic.data there?

Thank you for your help,
Thomas

Comment by Teemu Sirkiä [ 19/Aug/16 ]

Sure! It is 98 mb, is it OK to drop some of the older files?

BR,
Teemu

Comment by Kelsey Schubert [ 19/Aug/16 ]

Hi ttsirkia,

Would you please archive (tar or zip) the $dbpath/diagnostic.data directory and attach it to this ticket, so we can continue to investigate this issue?

Thank you,
Thomas

Comment by Teemu Sirkiä [ 19/Aug/16 ]

My environment is Linux 3.13.0-93-generic #140-Ubuntu SMP Mon Jul 18 21:21:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Generated at Thu Feb 08 04:09:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.