[SERVER-27596] Wiredtiger very high CPU usage Created: 06/Jan/17  Updated: 30/Jan/17  Resolved: 30/Jan/17

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.2.11
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Vincent van Megen Assignee: Kelsey Schubert
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Debian 7, xeon d1521, 32gb ram, 2x480gb ssd, 1gbit networking


Attachments: PNG File Screen Shot 2017-01-06 at 11.59.28.png     PNG File Screen Shot 2017-01-09 at 13.15.39.png     PNG File Screen Shot 2017-01-09 at 13.15.49.png     PNG File Screen Shot 2017-01-09 at 13.44.14.png    
Participants:

 Description   

I'm running a 3 member replicaset. The average workload is very constant and does not push load average above 2-3. When workload increases (for example due to a delay in our processing queue) mongodb starts processing queries very fast but after about 5-10 minutes completely starts hogging userspace CPU usage (100%, load average 120-130).

Even when completely shutting down the workload, load average still stays at 120-130, the only way for it to go back to normal is running rs.stepDown() and force another member to be primary. This instance then (usually) starts processing the requests very fast but sometimes still goes back to the very high load average.

In the mongodb log I can see write requests are taking a very long time (some upto 10 seconds).

I'm not really sure how to prevent this as it completely disables my whole workload.



 Comments   
Comment by Kelsey Schubert [ 30/Jan/17 ]

Hi vincentvm,

It appears that your workload is hitting hardware limits. The workload increase in is significant (possibly as high as 10x as many operations at its peak) and it's likely that the system's I/O was initially constrained as the workload spiked. As throughput slows, more work continues to come into applications, but cannot be cleared as quickly, causing WiredTiger cache eviction threads to hit the CPU limit. My advice would be to consider stabilizing your workload or provision a more powerful host to resolve this issue.

Kind regards,
Thomas

Comment by Vincent van Megen [ 09/Jan/17 ]

Another screenshot of MongoDB Compass showing the amount of queued writes increasing.

Comment by Vincent van Megen [ 09/Jan/17 ]

2 screenshots of cloud.mongodb.com statistics while this issue is happening.

Comment by Vincent van Megen [ 06/Jan/17 ]

I don't have the logs anymore, i uploaded the diagnostics data directory. It's in 3 folders 0,1,2 directory 2 was the primary servers. It started happening at around 9am gmt+1 and lasted untill about 11am gmt+1 on the 6th of january.

Comment by Kelsey Schubert [ 06/Jan/17 ]

Hi vincentvm,

If you could upload the complete directory, it may provide us with additional context about your typical workload. I've created a secure portal for you to upload the diagnostic.data as well as the complete mongod logs for each node.

Thanks again,
Thomas

Comment by Vincent van Megen [ 06/Jan/17 ]

I have the diagnostics.data directory, where can I upload these files? Should I just upload the diagnostics files from the day this happened?

Comment by Kelsey Schubert [ 06/Jan/17 ]

Hi vincentvm,

Thanks for reporting this behavior. So we can continue to investigate, would you please provide an archive of the diagnostic.data and complete logs for each node in the replica set?

Thank you,
Thomas

Generated at Thu Feb 08 04:15:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.