[SERVER-19485] Cpu 100% when mongodb running with normal ops Created: 20/Jul/15  Updated: 26/Sep/15  Resolved: 26/Sep/15

Status: Closed
Project: Core Server
Component/s: Performance, WiredTiger
Affects Version/s: 3.0.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: carl dong Assignee: Ramon Fernandez Marina
Resolution: Done Votes: 4
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File 2015-07-27.rar     JPEG File CPU Screenshot.jpg     File DB.rar     File DBlog0728.rar     JPEG File Detail_Cpu_ScreenShot2.jpg     JPEG File Detail_Cpu_screenshot.jpg     JPEG File OpsManager_Screenshot.jpg     PNG File cpu.png     PNG File cpu_usage.png     File llorenzen_serverStatus.json     File logs.rar     JPEG File perf top sc.jpg     Text File serverStatus.txt     Text File serverstatus.txt    
Issue Links:
Related
is related to SERVER-19064 Unpredictable High CPU usage not rela... Closed
Operating System: ALL
Participants:

 Description   

Hi Team ,

There is a mongodb session will cost cpu 100% when there is only normal ops on the instance , issue can be solved by restart the database , but it will appear after a few weeks .
It is always the same pid in linux costs 100% cpu , attached some screen shot while running top command .

As well I see the information : "thread over memory limit, cleaning up" very often in the db log , there is an issue with my database ?

Config file option:

processManagement :
 fork : true
 pidFilePath :  /home/Radius/Rad.pid
systemLog:
 destination : file
 path : /home/Radius/Log/Rad_Log
 logAppend : true
 
storage:
 dbPath : /home/Radius/Data_WiredTiger
 engine : wiredTiger
 wiredTiger :
  engineConfig :
   cacheSizeGB : 20
 directoryPerDB : true
 
net:
 port : 40001
 maxIncomingConnections : 2000
 
security:
 authorization  : enabled
 keyFile : /home/Radius/Keys/RadKey
 
replication:
 replSetName : rad
 
auditLog:
 destination : file
 format : JSON
 path : /home/Radius/Log/RadiusDBAduit.json
 filter : '{ atype : {$nin : ["authenticate"]} }'



 Comments   
Comment by Ramon Fernandez Marina [ 26/Sep/15 ]

Closing this ticket after the 3.0.6 release has been out for a while and no new reports of this behavior have come in.

Comment by Ramon Fernandez Marina [ 31/Aug/15 ]

huy, can you please open a separate ticket? Even if you're seeing the same symptoms the root cause may be different. Please provide full details about your deployment (replica set configuration, amount of memory, number of cpus, etc.) and full logs for the affected server. The first order of business will probably be to collect server statistics for a period of time as described above.

Thanks,
Ramón.

Comment by Huy [ 31/Aug/15 ]

Hi everybody,

I follow this issue because I'm facing with the same problem. (mongo 3.0.6)
My server status when it happened (file attach).

Regards

Huy.

Comment by Ramon Fernandez Marina [ 27/Aug/15 ]

Thanks carl.dong@windfindtech.com. We'll keep this ticket in "Waiting for User Input" until we hear back from you.

Regards,
Ramón.

Comment by carl dong [ 27/Aug/15 ]

Ramon , I have updated my database to MongoDB 3.0.6 , will monitor it for some days and keep here update.

Comment by Ramon Fernandez Marina [ 25/Aug/15 ]

carl.dong@windfindtech.com, as per michael.cahill's comment above SERVER-19522 was fixed in MongoDB 3.0.6, which was released earlier this week. Can you please try 3.0.6 and see if it addresses your issue?

Thanks,
Ramón.

Comment by Michael Cahill (Inactive) [ 29/Jul/15 ]

carl.dong@windfindtech.com we have been looking through the data, the best theory we have at the moment is that you might be hitting SERVER-19522, which has a fix in progress.

The symptom of that ticket is that truncating the oplog to keep the size capped can leave behind some empty pages in the cache. Over time, the number of pages can grow large and when a checkpoint runs after the oplog is truncated, it has to step over all of those empty pages before getting useful work done.

I will let you know when the issue is resolved: we are currently testing the fix.

Comment by carl dong [ 28/Jul/15 ]

Bruce ,

1. There is no regular periodic application operating on the database.
2. I believe it happens much more often under heavy load , and it should be something accumulation trigger the issue , cause as I restarted the database , under the same level load it is not happening in many days .
3. I can only provide the log with level 2 while the issue happening in the database , there is not information in /var/log/messages or /var/log/syslog when when issue happen.
4. There are 40 cpus on the server , and the 100% cpu means only 1 cpu is 100% used ( running __wt_tree_walk ), others are in normal state.

Thanks

Carl

Comment by Bruce Lucas (Inactive) [ 27/Jul/15 ]

Thanks Carl. From that data we can see that the high CPU utilization precisely correlates with the checkpoints. This suggests that the checkpoint activity is either directly or indirectly responsible for the high CPU utilization. (I've attached a graph showing the correlation and some other correlated statistics that we're using for an internal discussion.)

The regularity of the occurrence of the issue is striking. I have two theories about that:

  • Some regular periodic application activity triggers the issue; do you have any such periodic activity?
  • However the interval between occurrences, while regular, seems to be somewhat variable, and possibly load dependent - for example, you mention that during busy times it occurs every 10 minutes, whereas in the data you uploaded it it happened at 20 minute intervals. Do you believe it happens more often under heavy load? This suggests an accumulation of some kind of state that is then periodically cleaned up in some way.

Would you be able to upload the exact mongod log corresponding to the latest data you uploaded? This will allow us to look for events that are exactly correlated with the CPU issue. Also if you could upload the syslog file (/var/log/messages or /var/log/syslog) corresponding to the same time period we could look for correlated events at the system level.

Finally the CPU percentage reported by iostat was at most about 5.8%, so I assume that the 100% CPU utilization that you mention was 100% of a single CPU on a multi-CPU system (whereas iostat is reporting % of total CPU resources). Since WT in general is able to make effective use of multiple CPUs, while 100% utilization of a single CPU is something to be investigated, I wouldn't expect it to necessarily cause any issues in a multi-CPU system. Are you observing any application-level issues as a result of the 100% CPU utilization? If so that might tell us more about the nature of the 100% CPU issue.

Thanks,
Bruce

Comment by Mike Bartlett [ 27/Jul/15 ]

Michael, the main problem that we've had is that we've only experienced this issue when the WiredTiger replica members are primary. We've now stepped them down to secondary, so not sure if we're going to be able to get any more data as we're loathe to step the machines back up again as the problem brings our entire app down to it's knees when it occurs.

We will see what we can do. Do you have any specific references to issues fixed in 3.0.5 and is it still on track for release tomorrow?

Cheers
Mike
(PS. How do you mention people in Jira?)

Comment by carl dong [ 27/Jul/15 ]

Usually the issue will last for 2 - 3 mins , in busy time , it happens every 10 mins .

Just capture 3 times issue happen with serverstatus and iostat, time period:
02:18:24~02:21:26 , 02:38:20~02:40:40 , 02:57:48~02:59:48
Attache logfiles.

Bty , I'm going to upgrade to 3.0.5 as soon as stable release available with the fixed capped collection performance to see if the problem related to the capped collections.

Comment by Ramon Fernandez Marina [ 24/Jul/15 ]

Thanks for posting that information carl.dong@windfindtech.com. Based on the times you mention the CPU utilization seems to be correlated with some unusually long checkpoints, but we would need more precise timing information to tell whether it is victim or culprit. Would it be possible for you to repeat the serverStatus timeseries data collection, but also add iostat, as follows:

mongo --eval "while(true) {print(JSON.stringify(db.serverStatus())); sleep(1000)}" >ss.log
iostat -k -t -x 1 >iostat.log &

Note I have adjust the delay in the serverStatus timeseries to 1s, matching the 1s resolution of the iostat data. If you could do this a couple of times, or leave it running long enough to capture a couple of incidents, that will help us pin down the correlation between CPU utilization and events inside mongod.

Also, can you tell us what the general pattern of 100% CPU utilization is - how long does the issue last? How frequently does it occur (once it has started occurring after the few weeks)?

By the way, I should note that in the serverStatus excerpt you posted earlier the oplog is full, as the oplog size limit is based on "size", not "storageSize", and size==maxSize in that excerpt. Given that, do you think it is possible that the problem begins to occur after the oplog becomes full?

Thanks,
Ramón (on Bruce's behalf)

Comment by carl dong [ 24/Jul/15 ]

ss.log for issue happen around 02:24:34.
ss1.log for issue happed around 02:41:50.

Comment by Michael Cahill (Inactive) [ 24/Jul/15 ]

Thanks for the additional information mydigitalself, lalo and carl.dong@windfindtech.com. At the moment we can't be completely sure that these three cases all have the same underlying cause but we are making that assumption. Getting "perf top" output when the system gets into this state would help to confirm this.

There are some differences: we note that in lalo's case, the system is using all available memory, whereas in the other cases there appears to be free memory available. When there is memory pressure, some data to be evicted from RAM managed by Linux (even if swap is disabled). That may partially explain additional reads. There are a number of fixes coming in 3.0.5 for issues where MongoDB with WiredTiger could use more memory that expected.

Are any of you able to gather server stats over a period of time (e.g., 5 minutes) when the system is in this state, then again (for another 5 minutes) after it is restarted and warmed up? That should help us pinpoint the differences: at the moment we are unable to reproduce this behavior and while we have some theories, we can't find a clear explanation.

Here is the recipe for gathering statistics every 10 seconds:

mongo --eval "while(true) {print(JSON.stringify(db.serverStatus())); sleep(10000)}" >ss.log

Comment by Mike Bartlett [ 23/Jul/15 ]

Hi folks,

We (Gitter) have experienced exactly the same issue twice now and thought I'd chime in a little bit with as much information as I can provide.

Replicasetup
1xarbiter
2xmongo 3.0.4 (WiredTiger/Snappy) (mongo-replica-member-001 & mongo-replica-member-005)
1xmongo 2.6 (mongo-replica-member-002)

We kept the 2.6 box specifcally for just-in-case scenarios as we only recently upgraded to WiredTiger.

The mongo 3.0.4 boxes are are both r3.large AWS instances sitting on top of PIOPS (2000) SSD drives. The 2.6 box is also an r3.large, but is on a RAID-10 array. So I know there are some differences in the infrastructure, but we are trying to move everything to single PIOPS SSD drives as they are much easier to manage.

Yesterday, mongo-replica-member-005 (WiredTiger) was Primary - we prefer the WiredTiger boxes to be primary, as their performance is much better - well done! - although we are starting to reconsider this given this issue has happened twice in production now and brings everything to it's knees.

Unfortunately I don't have a db.serverStatus() saved from when it happened. What we did notice is that the session open cursor count the first time we saw this issue was ~60k, which seems extraordinarily high to us. Below you can see a paste of doing db.serverStatus. Notice the increase across a minute.

TroupeReplicaSet:PRIMARY> db.serverStatus().wiredTiger.session

{ "open cursor count" : 32797, "open session count" : 144 }

TroupeReplicaSet:PRIMARY> db.serverStatus().wiredTiger.session

{ "open cursor count" : 32809, "open session count" : 144 }

and again this morning (~12 hours later)

TroupeReplicaSet:SECONDARY> db.serverStatus().wiredTiger.session

{ "open cursor count" : 46177, "open session count" : 163 }

This may be a red herring.

The only information I can provide is a screenshot from DataDog.
https://infinit.io/_/W486hu4

There are a few things to notice here:

1. Sudden increased CPU usage

An observation I noticed on the box at the top was watching with "top" I could see the mongodb process was running at ~107%. Normally it runs around 7%. Given the box has two cores and our understanding of the threads, that perhaps one thread was operating at 100% and the other was ticking along as normal at 7%. You can sort of see this in the CPU usage graph in that it shoots up to mildly above 50% (i.e. of 2 cores).

You can also notice there is a gradual increase for a few hours prior to the large spike.

2. Memory breakdown

Corresponding to the gradual increase in the CPU before the spike, you can see the memory usage races - effectively doubles. We have a vague recollection of reading something about WiredTiger not consuming all memory at first, but 50% and then increasing this on demand. We could be incorrect, but perhaps this is what's happening here is that more memory is allocated and then something funny (yes, that is a technical term) starts to happen after a while.

3. OPS

The lower right graph shows consumed OPS across all volumes. It's hard to tell as the purple line covers it a little, but corresponding to the initial increase in CPU and memory jump, the SSD (blue line) takes a little jump for a sustained period and the a huge jump during the CPU spike. You can see on AWS, that this is because of read operations: https://infinit.io/_/dn3Qn7t

==

The drops in the CPU spike and OPS were all caused by us stepping the box down as primary, and after a few minutes everything returned to normal without restarting mongod.

One final observation...

lsof reports that mongo has an incredibly high number of open files (at all times)...

root@mongo-replica-member-005:/home/deployer# lsof | grep data2 | wc -l
354316

And our ulimit in mongodb.conf is only 100000....

==

I hope that helps. We're loathe to put the box back as a primary until we have more insight on this, so I'm not sure if it will happen again.

Question for Lars Lorenzen: you mentioned you saw this on a slave (secondary I assume?), do you perform read operations against this slave at all (slaveOK)? We don't and so I'm not sure it will exhibit the same behaviour.

If it does happen again, we will certainly save the db.serverStatus and the results of perf top.

Any insight that anyone could provide would be great. We would be happy to field calls and work interactively with anyone at Mongo to help pin down this issue as we are all on edge right now

Cheers
Mike

Comment by Lars Lorenzen [ 20/Jul/15 ]

Can provide the log which is flowing over of slow query warnings. Happened on 2015-07-16 20:00 CEST.
I did not make screenshots of the CPU usage.
Will collect information next time it happens.
We are running mongodb 3.0.4 on CentOS and ext4 ssd.

Comment by Daniel Pasette (Inactive) [ 20/Jul/15 ]

hi lalo, are you able to run the same diagnostic steps as Carl has done previously?

Thanks,
Dan

Comment by Lars Lorenzen [ 20/Jul/15 ]

Having the same problem on our slave running mongod 3.0.4: One process is using 100% cpu and slowing down the whole database (read performance very poor). Restart helps.

Comment by carl dong [ 20/Jul/15 ]

Attached serverstatus ,
There are only normal ops in the instance , after restart the mongod , the issue will gone , I did this before but issue comes again in a few weeks .

Comment by Michael Cahill (Inactive) [ 20/Jul/15 ]

Thanks carl.dong@windfindtech.com, that makes things a little clearer. That thread is the one that finds pages to evict from the WiredTiger cache. It is expected to run frequently, but unless there are a lot of reads, I would not expect it to use 100% of a CPU core.

Can you please also run db.serverStatus() and post the full output here?

Comment by carl dong [ 20/Jul/15 ]

As well the oplog.rs collection is not full in my database.

> db.oplog.rs.stats(1024*1024)
{
"ns" : "local.oplog.rs",
"count" : 188920828,
"size" : 51200,
"avgObjSize" : 284,
"storageSize" : 21962,
"capped" : true,
"max" : -1,
"maxSize" : 51200,

Comment by carl dong [ 20/Jul/15 ]

it is in __wt_tree_walk as you can see in screen shot new updated.

Comment by Michael Cahill (Inactive) [ 20/Jul/15 ]

carl.dong@windfindtech.com, sorry you are having this problem with MongoDB.

I have looked through the log, but unfortunately I cannot see the cause from there. I don't think the message "thread over memory limit, cleaning up" is serious: it is just an indication that memory is being freed from a thread-local cache to be available to other threads. That happens any time a mongod thread becomes idle with more than 64KB memory in its thread local cache.

One way to be sure about what is happening would be to run "perf top -g" once mongod gets into this state and send us a screenshot of the busy functions.

Without profiling information, my best guess is that you may be seeing poor behavior of the thread that truncates from the oplog to keep the size capped. Tickets SERVER-18875 and SERVER-19178 dealt with these problems: they will be fixed in the upcoming 3.0.5 release.

Comment by carl dong [ 20/Jul/15 ]

Affect Version should be : 3.0.4 enterprise

Generated at Thu Feb 08 03:51:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.