[SERVER-56250] High CPU usage on Mongo server, but Mongo seems to be idle Created: 21/Apr/21  Updated: 27/Oct/23  Resolved: 16/May/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.2.13
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Sasa Trifunovic Assignee: Dmitry Agranat
Resolution: Community Answered Votes: 0
Labels: Performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PDF File 15-16-21-04-2021_serverStatus_high_average_load_high_mongo_load.pdf     PDF File 15-20-21-04-2021_serverStatus_high_average_load_low_mongo_load.pdf     PNG File instance_metrics_1.png     PNG File instance_metrics_2.png     HTML File jobHolders     HTML File jobRecord     PNG File mongo_exporter_1.png     PNG File mongo_exporter_2.png     PNG File mongostat.png     PNG File mongotop.png    
Operating System: ALL
Participants:

 Description   

We are running MongoDB in version 4.2.13. Replica set, primary and two replicas. Servers have 4 CPUs and 16gb of RAM (m5.xlarge on AWS) and are dedicated only to Mongo. 
We are running Mongo with default config and transactionLifetimeLimitSeconds set to 900.

During the load tests, we are regularly encountering situations where the primary gets stuck. We are processing messages from mq with 10 threads and with those threads we are inserting results into Mongo. 
Load average becomes very high, around 9, and by watching mongotop and mongostat it seems Mongo isn’t performing any db operation at that time. iostat shows high values for user and idle params.

We couldn’t find any hint even with the profiler turned on. We have necessary indexes, meaning we are not seeing COLLSCAN in currentOp, which also didn't reveal to us anything obviously abnormal.

In attachment you can find:

instance metrics, mongo metrics, mongostat and mongtop of problematic part (part where instance is under load but nothing shows under mongo),
serverStatus during the peak (high average load and high mongo load) and serverStatus after the peak (high average load and low mongo load). Stats for two of our main collections, jobHolder and jobRecord.

If necessary we can provide more info or perform additional tests and post results.

We are grateful for any advice on how to overcome this issue. 



 Comments   
Comment by Dmitry Agranat [ 05/May/21 ]

Hi sasa.s.trifunovic@gmail.com

Is it due to our test constantly hitting that eviction_dirty_trigger as soon as Mongo eviction threads clean up some of the dirty pages? Like some sort of seesaw stalemate?

This is precisely the case with this workload

Comment by Sasa Trifunovic [ 05/May/21 ]

Hi Dmitry, 

Although I understand that hitting configured eviction_dirty_trigger (default 20%) with dirty pages will result in Mongo trying to evict them and with that in a much higher average load, I don't understand why this average load would never go down unless we would abort the load test? 
Is it due to our test constantly hitting that eviction_dirty_trigger as soon as Mongo eviction threads clean up some of the dirty pages? Like some sort of seesaw stalemate?

Comment by Dmitry Agranat [ 02/May/21 ]

Hi sasa.s.trifunovic@gmail.com, thank you for uploading the requested information.

What causing the rise in CPU utilization is eviction trying to clean up dirty pages. This is expected.

Looking at the workload, you are basically overwhelming the cache by trashing the bytes in and out of the cache. For example, given the dirty cache portion being configured at ~3GB (on your 15 GB RAM server) when we decide to start evicting data, and the fact that we need to evict all of this data during this workload every minute, this is where the eviction process kicks in resulting in the elevated CPU.

I am not sure how much data you need to insert/update per second during this test and what is the average page size but the in general, the way to avoid this situation is either to throttle the workload, scale horizontally (sharding) or add more RAM (though I am not sure how much but given you need to flush ~3 GB during the high CPU event, I would start testing with at least 30 GB and move forward based on the results).

Dima

Comment by Sasa Trifunovic [ 29/Apr/21 ]

Looking at mongostat during our load tests it becomes apparent that mongod starts stalling when it hits 20% of dirty pages. Looking at mongo doc it seems like something expected but what seems to be very strange is that the percentage of dirty pages, shown in mongostat, afterwards never goes down, like eviction never occurs.

Comment by Sasa Trifunovic [ 28/Apr/21 ]

Hi Dmitry,

 

Previously today we also tried unordered bulk write (we were using the default, ordered) which didn't help, and just now, in an effort to eliminate replication as cause for this issue, we tried lowering write concern from majority to primary only, and that yielded exactly the same results, meaning Mongo got stuck.

Comment by Sasa Trifunovic [ 28/Apr/21 ]

Hi Dmitry,

Sorry for the incomplete logs. I just uploaded two files, diagnostics-April.tar.gz and mongod-logs-end-of-April.tar.gz. The first one contains diagnostics for the whole of April while the second one contains updated mongod logs for period between 20th and 27th of April. Both files contain logs from all members of the replica set.

Btw, we performed two additional tests, with various levels of profiling level (1 and 2), in hope that they may shed more light on this issue.
The first additional test with profiling set to 1 ran between 2021-04-22T13:16 and 2021-04-22T13:35 (UTC) and the second with profiling level set to 2 ran between 2021-04-22T14:21:36 and 2021-04-22T14:34. Both tests ended up with Mongo getting stuck.

On the next day (2021-04-23) we bumped or primary mongo instance to m5.2xlarge with gp2 disks. Initially, our load test went ok but on the second go Mongo got stuck again. Failed test started around 15:05 UTC and got aborted around 15:28.

Thanks a lot for looking into this.
 

Comment by Dmitry Agranat [ 26/Apr/21 ]

Hi sasa.s.trifunovic@gmail.com, the uploaded diagnostic.data only covers April 11-13 while the reported test was performed on April 21st. If you still have the data, could you upload it after checking that it covers the time of the reported test? In addition, it would be helpful to have the diagnostic.data from all 3 nodes in this replica set.

Comment by Sasa Trifunovic [ 22/Apr/21 ]

Hi,

I just uploaded logs(logs from the primary and its replicas spanning a couple of days)  and the complete diagnostics folder from the primary. 
Test started around Wed Apr 21 15:13:00 UTC 2021, but it is hard to determine its ending because this time we let it go without aborting it but I would say it lasted until Wed Apr 21 16:20:00 UTC 2021
For this load test, we have disabled profiling to make sure no overhead was introduced. 
If you want we can repeat the test with profiling turned on to whichever level you find suitable.
 

Comment by Dmitry Agranat [ 21/Apr/21 ]

Hi sasa.s.trifunovic@gmail.com

Would you please archive (tar or zip) the mongod.log files from all members covering the incident and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location?

Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Please also post the exact start and end timestamps and timezone for the load test. In addition, please note the time when the Database Profiler was turned on.

Dima

Generated at Thu Feb 08 05:38:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.