[SERVER-56250] High CPU usage on Mongo server, but Mongo seems to be idle Created: 21/Apr/21 Updated: 27/Oct/23 Resolved: 16/May/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 4.2.13 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Sasa Trifunovic | Assignee: | Dmitry Agranat |
| Resolution: | Community Answered | Votes: | 0 |
| Labels: | Performance | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Operating System: | ALL |
| Participants: |
| Description |
|
We are running MongoDB in version 4.2.13. Replica set, primary and two replicas. Servers have 4 CPUs and 16gb of RAM (m5.xlarge on AWS) and are dedicated only to Mongo. During the load tests, we are regularly encountering situations where the primary gets stuck. We are processing messages from mq with 10 threads and with those threads we are inserting results into Mongo. We couldn’t find any hint even with the profiler turned on. We have necessary indexes, meaning we are not seeing COLLSCAN in currentOp, which also didn't reveal to us anything obviously abnormal. In attachment you can find: instance metrics, mongo metrics, mongostat and mongtop of problematic part (part where instance is under load but nothing shows under mongo), If necessary we can provide more info or perform additional tests and post results. We are grateful for any advice on how to overcome this issue. |
| Comments |
| Comment by Dmitry Agranat [ 05/May/21 ] |
|
Hi sasa.s.trifunovic@gmail.com
This is precisely the case with this workload |
| Comment by Sasa Trifunovic [ 05/May/21 ] |
|
Hi Dmitry, Although I understand that hitting configured eviction_dirty_trigger (default 20%) with dirty pages will result in Mongo trying to evict them and with that in a much higher average load, I don't understand why this average load would never go down unless we would abort the load test? |
| Comment by Dmitry Agranat [ 02/May/21 ] |
|
Hi sasa.s.trifunovic@gmail.com, thank you for uploading the requested information. What causing the rise in CPU utilization is eviction trying to clean up dirty pages. This is expected. Looking at the workload, you are basically overwhelming the cache by trashing the bytes in and out of the cache. For example, given the dirty cache portion being configured at ~3GB (on your 15 GB RAM server) when we decide to start evicting data, and the fact that we need to evict all of this data during this workload every minute, this is where the eviction process kicks in resulting in the elevated CPU. I am not sure how much data you need to insert/update per second during this test and what is the average page size but the in general, the way to avoid this situation is either to throttle the workload, scale horizontally (sharding) or add more RAM (though I am not sure how much but given you need to flush ~3 GB during the high CPU event, I would start testing with at least 30 GB and move forward based on the results). Dima |
| Comment by Sasa Trifunovic [ 29/Apr/21 ] |
|
Looking at mongostat during our load tests it becomes apparent that mongod starts stalling when it hits 20% of dirty pages. Looking at mongo doc it seems like something expected but what seems to be very strange is that the percentage of dirty pages, shown in mongostat, afterwards never goes down, like eviction never occurs. |
| Comment by Sasa Trifunovic [ 28/Apr/21 ] |
|
Hi Dmitry,
Previously today we also tried unordered bulk write (we were using the default, ordered) which didn't help, and just now, in an effort to eliminate replication as cause for this issue, we tried lowering write concern from majority to primary only, and that yielded exactly the same results, meaning Mongo got stuck. |
| Comment by Sasa Trifunovic [ 28/Apr/21 ] |
|
Hi Dmitry, Sorry for the incomplete logs. I just uploaded two files, diagnostics-April.tar.gz and mongod-logs-end-of-April.tar.gz. The first one contains diagnostics for the whole of April while the second one contains updated mongod logs for period between 20th and 27th of April. Both files contain logs from all members of the replica set. Btw, we performed two additional tests, with various levels of profiling level (1 and 2), in hope that they may shed more light on this issue. On the next day (2021-04-23) we bumped or primary mongo instance to m5.2xlarge with gp2 disks. Initially, our load test went ok but on the second go Mongo got stuck again. Failed test started around 15:05 UTC and got aborted around 15:28. Thanks a lot for looking into this. |
| Comment by Dmitry Agranat [ 26/Apr/21 ] |
|
Hi sasa.s.trifunovic@gmail.com, the uploaded diagnostic.data only covers April 11-13 while the reported test was performed on April 21st. If you still have the data, could you upload it after checking that it covers the time of the reported test? In addition, it would be helpful to have the diagnostic.data from all 3 nodes in this replica set. |
| Comment by Sasa Trifunovic [ 22/Apr/21 ] |
|
Hi, I just uploaded logs(logs from the primary and its replicas spanning a couple of days) and the complete diagnostics folder from the primary. |
| Comment by Dmitry Agranat [ 21/Apr/21 ] |
|
Hi sasa.s.trifunovic@gmail.com Would you please archive (tar or zip) the mongod.log files from all members covering the incident and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location? Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time. Please also post the exact start and end timestamps and timezone for the load test. In addition, please note the time when the Database Profiler was turned on. Dima |