[SERVER-21689] 100% CPU usage when rapidly creating/dropping collections Created: 30/Nov/15 Updated: 08/Feb/16 Resolved: 05/Feb/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | 3.0.7 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Tomasz Chmielewski | Assignee: | Unassigned |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Backwards Compatibility: | Fully Compatible | ||||
| Operating System: | ALL | ||||
| Participants: | |||||
| Description |
|
Our mongo installation is experiencing weird issue: after starting mongod, CPU usage on one core starts growing, until it's hitting 100% (on just one core, with other CPU cores mostly idle). Restarting mongod on the affected server "fixes" the issue - for a few hours hours, CPU usage is very small on all cores, then, it gradually grows to 100% on one core. After a few days mongod is barely responsive, even simple operations like "show dbs" are taking 10 seconds or more. The issue is not related to excessive disk IO or network traffic (which both remains very small). When we run "perf top -g", it mostly shows the following:
Related mongodb-user discussion: https://groups.google.com/forum/#!topic/mongodb-user/K2WkN2dOKMA Let me know what other information you need. |
| Comments |
| Comment by Alexander Gorrod [ 05/Feb/16 ] | |||||||||||||||||||||||||||||||||
|
Thanks for confirming tch@virtall.com. I'm going to close this ticket, since we understand the reason for the symptom was creating and dropping many collections. | |||||||||||||||||||||||||||||||||
| Comment by Tomasz Chmielewski [ 12/Jan/16 ] | |||||||||||||||||||||||||||||||||
|
From the time we're not creating/dropping collections rapidly (many times per second in a healthcheck), we're not seeing this issue anymore. | |||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 11/Jan/16 ] | |||||||||||||||||||||||||||||||||
|
Apologies for the radio silence tch@virtall.com. I've create an upload portal where you can upload files privately and securely. Thanks, | |||||||||||||||||||||||||||||||||
| Comment by Tomasz Chmielewski [ 23/Dec/15 ] | |||||||||||||||||||||||||||||||||
|
We are using 3.2.0, so in theory we shouldn't be affected? As for collection drops - I see our application health check script does that, from many servers, perhaps not really a good idea (creates a new collection, then drops it), so I will notify our developers to change it. pstack - we won't be able to collect it in any useful manner, as we now actively monitor "show dbs" reply times, and if it takes more than a second, we restart a given primary. As for log files - can we upload them somehow privately? We log slow queries, and it could expose sensitive data. | |||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 22/Dec/15 ] | |||||||||||||||||||||||||||||||||
|
Hi Tomasz, one more question - the stats show a healthy rate of collection drops, about 3 per second on average. This makes me wonder if you might be seeing an impact related to | |||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 22/Dec/15 ] | |||||||||||||||||||||||||||||||||
|
Thanks Tomasz, we are analyzing those files. Could you also upload mongod log files covering those incidents? Complete log files including at least a few hours before the incident and some time after would be ideal. Also, would you be in a position to use pstack to collect stack traces the next time this happens? Ideally we'd like to see several (as many as 10 if possible) pstack samples at say 1 minute intervals to be sure we're seeing typical behavior (although I understand you may not want to leave the server in the low performing state for a long time). Thanks, | |||||||||||||||||||||||||||||||||
| Comment by Tomasz Chmielewski [ 22/Dec/15 ] | |||||||||||||||||||||||||||||||||
|
The files are quite large, I've uploaded them here: http://www.virtall.com/files/temp/mongo/2015-12-22/ It contains diagnostic.data data for all three servers consisting a primary shard. repl01b - had an issue (extremely slow replies as a primary replica server) on 21 Dec 2015, was then restarted | |||||||||||||||||||||||||||||||||
| Comment by Asya Kamsky [ 22/Dec/15 ] | |||||||||||||||||||||||||||||||||
|
Would you be able to upload the diagnostics dats file to this ticket? It's contents are roughly equivalent to what you would get from DB.serverStatus metrics. It would help to know exact time that represent the drop in performance to correlate it with metrics in the files. | |||||||||||||||||||||||||||||||||
| Comment by Tomasz Chmielewski [ 22/Dec/15 ] | |||||||||||||||||||||||||||||||||
|
Also, in 3.2.0, still having replication delay issues. So while "100% CPU usage in one core" is no longer the issue, other symptoms are still there (delay in replication, mongo getting slower and slower over time, up to a point where it's no longer usable). | |||||||||||||||||||||||||||||||||
| Comment by Tomasz Chmielewski [ 21/Dec/15 ] | |||||||||||||||||||||||||||||||||
|
Unfortunately we had a "similar" issue again, with 3.2.0. The symptoms were similar (mongo very, very slow to reply), however, there was no 100% CPU usage on any of the cores (plus, system load was low, around 0.2, barely any IO). Like before, it was a primary shard. There were no errors in the mongo log, however, the queries recorded there seem to be taking longer and longer, over time. The problem was that even the most basic queries, like "show dbs" were taking ages, for example:
Is there any command to read the information from "diagnostic.data" directory? | |||||||||||||||||||||||||||||||||
| Comment by Asya Kamsky [ 15/Dec/15 ] | |||||||||||||||||||||||||||||||||
|
tch@virtall.com thanks for that info. 3.2.0 automatically collections diagnostic information we got you to collect for 3.0 when this problem manifests. Can you keep an eye on the 3.2.0 instance and if you should see the same symptom on it then upload to us the contents of the data directory subdirectory called diagnostic.data? Asya | |||||||||||||||||||||||||||||||||
| Comment by Tomasz Chmielewski [ 13/Dec/15 ] | |||||||||||||||||||||||||||||||||
|
FYI, this issue does not show up after upgrading to Mongo 3.2.0. | |||||||||||||||||||||||||||||||||
| Comment by Tomasz Chmielewski [ 09/Dec/15 ] | |||||||||||||||||||||||||||||||||
|
Requested data can be downloaded below (it was way too large to attach it here): http://www.virtall.com/files/temp/mongo/ I've also included "sar" output:
At this time, after around 1 day, replication delay is as follows:
If you need any other information, let me know. | |||||||||||||||||||||||||||||||||
| Comment by Tomasz Chmielewski [ 08/Dec/15 ] | |||||||||||||||||||||||||||||||||
|
Update/observation: after about a day, secondaries with WiredTiger use 100% CPU on one core and start increasing replication delay, i.e.:
iostat shows little IO on them. After restarting mongod, iostat shows around 80-90% busy IO, until replication delay goes down to 0 - then, IO is low. Will start collecting stats after this moment on both of them, until we go to 100% CPU on one core and growing replication delay. | |||||||||||||||||||||||||||||||||
| Comment by Tomasz Chmielewski [ 07/Dec/15 ] | |||||||||||||||||||||||||||||||||
|
Requested stats - will provide later this week. | |||||||||||||||||||||||||||||||||
| Comment by Tomasz Chmielewski [ 07/Dec/15 ] | |||||||||||||||||||||||||||||||||
|
We've run a couple of tests (with 3.0.7) and have the following conclusions:
| |||||||||||||||||||||||||||||||||
| Comment by Daniel Pasette (Inactive) [ 06/Dec/15 ] | |||||||||||||||||||||||||||||||||
|
Hi tomasz
Dan | |||||||||||||||||||||||||||||||||
| Comment by Tomasz Chmielewski [ 06/Dec/15 ] | |||||||||||||||||||||||||||||||||
|
When I attach strace to a mongo process / thread using 100% CPU, it only outputs these at rapid pace:
| |||||||||||||||||||||||||||||||||
| Comment by Tomasz Chmielewski [ 05/Dec/15 ] | |||||||||||||||||||||||||||||||||
|
Update: changed to 3.0.7 again; it only seems to affect the primary shard (there are 2 shards, each is a 3-server replica). I've moved the primary, and 100% CPU usage on one core after a few hours of mongod process uptime is showing up on the "new" primary shard (on all 3 servers in this replica). | |||||||||||||||||||||||||||||||||
| Comment by Tomasz Chmielewski [ 01/Dec/15 ] | |||||||||||||||||||||||||||||||||
|
Just to test it out, downgraded from 3.0.7 to 3.0.6, but it's showing the same behaviour. | |||||||||||||||||||||||||||||||||
| Comment by Tomasz Chmielewski [ 30/Nov/15 ] | |||||||||||||||||||||||||||||||||
|
Observation: on an affected (secondary) server, after killing all mongo traffic on eth0 with iptables (-j REJECT), mongod process is still using around 50% CPU. mongotop is not showing any reads or writes (which is expected, since network communication is dead for mongod). mongostat is still showing around 30 connections (probably TCP still didn't time them out). After mongod restart, with traffic still blocked with iptables, mongod CPU usage is 0% (expected, no reads/writes). | |||||||||||||||||||||||||||||||||
| Comment by Tomasz Chmielewski [ 30/Nov/15 ] | |||||||||||||||||||||||||||||||||
|
More info:
|