[SERVER-75514] baseline CPU usage and disk reads increased significantly with >6.0.3 Created: 30/Mar/23 Updated: 07/Jul/23 Resolved: 07/Jul/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Jordan Baczuk | Assignee: | Chris Kelly |
| Resolution: | Done | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
MongoDB v6.0.3 on a t3.micro (x86_64) running Ubuntu 18.04.2 LTS with a gp3 volume for data storage and an ext4 filesystem |
||
| Attachments: |
|
| Assigned Teams: |
Product Performance
|
| Operating System: | ALL |
| Participants: |
| Description |
|
baseline CPU usage and disk reads increased significantly with >6.0.3 Hi! We are running MongoDB v6.0.3, when the database was upgraded to v6.0.5, the following changes were noted on the server and disk volume (without changes to the application using the database):
The same changes were observed in v6.0.4 and downgrading to v6.0.3 resolved the issue. Was something changed in v6.0.4 that may have cause this? I'm trying to figure out what I need to change in order to support the newer versions without the huge performance hit. |
| Comments |
| Comment by Jordan Baczuk [ 07/Jul/23 ] | ||||||||||||||||||
|
Thanks for your help, what info I can provide is that we have a collection with 300GB of data with a structure similar to this: , } | ||||||||||||||||||
| Comment by Chris Kelly [ 07/Jul/23 ] | ||||||||||||||||||
|
We haven’t heard back from you for some time, so I’m going to close this ticket. If this is still an issue for you, please provide additional information and we will reopen the ticket. | ||||||||||||||||||
| Comment by Chris Kelly [ 06/Jun/23 ] | ||||||||||||||||||
|
Thanks for your patience, and supplying additional information. However, given what's provided, it still lends itself to being something workload-specific that we may need your specific case to understand fully. If you have a test driver that is reproducing the issue, we would definitely like to see it. I attempted to verify these claims independently and did not observe your issue, using updateOne and updateMany with a node driver against MongoDB 6.0.3 and 6.0.4, with 100,000 pre-inserted documents. I did not observe the cache behavior you appear to be observing.
Where A->B is attempting to concurrently update around 100,000 documents in a single collection in a 3 node replica set, as a string of updateOne commands. I attempted the same with updateMany with similar results. | ||||||||||||||||||
| Comment by Ben Figlin [ 18/May/23 ] | ||||||||||||||||||
|
I also collected the stack trace for 6.0.4 when the CPU was in the iowait state versus when it was idle, the stack trace looks quite identical except for ONE thread which is different, and as I expected, it is the checkpointer: 6.0.4 right after running the query, while in high CPU iowait state:
6.0.4 right after running the query, after high CPU iowait state ends and CPU back to idle:
| ||||||||||||||||||
| Comment by Ben Figlin [ 18/May/23 ] | ||||||||||||||||||
|
Hi,
I moved the DB data to an XFS filesystem, just in case. exact same behavior as before. I tried comparing 6.0.3 to 6.0.4 so that there is the smallest possible amount of changes between the two version. 6.0.3 has no issues, 6.0.4 has issues.
I enabled log verbosity as mentioned above, the difference in the log files between the two versions is this line that shows up on 6.0.4:
it happens twice, and is consistent with two ~30 second periods of high CPU in iowait. during those iowait periods the RAM usage increases and there are a lot of disk reads.
two logs (filtered by "Checkpointer") 6.0.3:
6.0.4
CPU usage (notice the iowait, in purple, which aligns with the checkpoint saving) - same operation, which ran and completed in 45 seconds in both cases 6.0.3:
6.0.4:
With another run, I attached full log right after the query itself finishes (where this high CPU iowait happens on 6.0.4), heap profiling enabled as well. see the two log files in the ticket. These lines seem to show up on 6.0.4 but not on 6.0.3:
I think these graphs (that were uploaded here by the MongoDB team) show a very clear difference with caching or something else to do with memory usage that is very different between <=6.0.3 and 6.0.4: | ||||||||||||||||||
| Comment by Chris Kelly [ 10/May/23 ] | ||||||||||||||||||
|
Thanks for clarifying! This is definitely interesting; some additional data may help us uncover what's going on. Some comments:
| ||||||||||||||||||
| Comment by Ben Figlin [ 10/May/23 ] | ||||||||||||||||||
|
The credits are not fully utilized during the tests. Also, we can go back and forth between 6.0.3 and 6.0.5, multiple times, and clearly see the performance hit on 6.0.5. We tried it multiple times, sometimes with 6.0.5 first, sometimes with 6.0.3 first.
The performance analysis screen shots that ahmad.shah@mongodb.com uploaded show a clear and large difference between the two versions, specifically on the read ops and the io wait CPU state. I understand that this is likely due to a bottleneck on the storage, but this bottleneck does not exist on <6.0.3, so I guess something in the way 6.0.5 handles storage access is different. Just a reminder per the original ticket description, the issue started with 6.0.4, and continued to happen with 6.0.5.
Thanks, Ben
| ||||||||||||||||||
| Comment by Chris Kelly [ 10/May/23 ] | ||||||||||||||||||
|
After reviewing your most recent data, we are suspecting this issue is possibly related to the use of the T3 micro instance for this test. Specifically, if you are running the test on 6.0.3 first, we are initially suspecting you may be utilizing your burst credits prior to the 6.0.5 run. Before going further, can you please clarify:
Chris | ||||||||||||||||||
| Comment by Ben Figlin [ 10/Apr/23 ] | ||||||||||||||||||
|
Hi chris.kelly@mongodb.com , I'm jordan.baczuk@qnergy.com's teammate. I created a test nodejs script that updates 100k documents in a collection, always the same documents for consistency. The script runs on a different server from the mongod instance itself to isolate resources used by nodejs from resources used by mongod. nothing else is using the DB at the time of the tests.
Please see the attached image (those charts were taken from netdata). It is clear that with 6.0.5 there is a strange huge spike in IOPS on the db data storage disk which lasts a relatively long time after the script has ended and exited, plus, there is another strange spike that comes later, and only happens on 6.0.5. The CPU usage in all charts comes from the mongod service. With 6.0.3, the CPU usage and disk usage look normal, and the CPU goes down to ~0% immediately after the script ends.
I also sent the diagnostic and log files to the secure upload link you shared above (file name 6.0.3-vs-6.0.5_same-operation.zip).
| ||||||||||||||||||
| Comment by Jordan Baczuk [ 10/Apr/23 ] | ||||||||||||||||||
|
chris.kelly@mongodb.com Ok, I wasn't sure how to interpret the charts you posted. I assume that both versions of the database are charted together with the older version first. I can try to set that test environment up again when I have time. It is quite strange that the CPU usage is lower for the last version, while on the EC2 instance, you can see it is quite higher. AFAIK, there were no other applications running on that instance, only MongoDB (except for the stuff that comes installed by default), so I'm not sure where to look next. In the meantime, is there a tool that I can use to view the diagnostic data on my end? | ||||||||||||||||||
| Comment by Chris Kelly [ 07/Apr/23 ] | ||||||||||||||||||
|
No problem, jordan.baczuk@qnergy.com! To address your concerns:
I would confirm that:
Christopher
| ||||||||||||||||||
| Comment by Jordan Baczuk [ 06/Apr/23 ] | ||||||||||||||||||
|
chris.kelly@mongodb.com Thanks for taking a look. The workload should have been identical between the 2 tests. Just some thoughts... It's possible a mistake was made when running the 6.0.5 test, but perhaps the 6.0.5 logs aren't showing the inserts because none of them broke the slow query threshold of 100ms? Also, if there were no inserts, shouldn't that be an even bigger red flag if the CPU usage increased that much? | ||||||||||||||||||
| Comment by Chris Kelly [ 06/Apr/23 ] | ||||||||||||||||||
|
After review of your diagnostic data it appears the workloads between your two runs are not identical. Please supply a test driver that reproduces your observations clearly and identically between the two instances if possible. For example: your 6.0.5 instance didn't perform any inserts, whereas 6.0.3 did. This implies some significant differences in the workload that makes it difficult to draw conclusions.
| ||||||||||||||||||
| Comment by Jordan Baczuk [ 30/Mar/23 ] | ||||||||||||||||||
|
That info has been uploaded, thanks. | ||||||||||||||||||
| Comment by Yuan Fang [ 30/Mar/23 ] | ||||||||||||||||||
|
Thank you for your report. I understand that you have observed an increase in CPU usage and disk reads after upgrading to version >=v6.0.4. I will bring this to the attention of our team. In the meantime, could you please provide diagnostic data that covers the event of downgrading to v6.0.3, or, upgrading to v6.0.4/v6.0.5 so that we can compare the two versions (i.e. v6.0.3 and >=v6.0.4)? I've created a secure upload portal for you. Files uploaded to this portal are hosted on Box, are visible only to MongoDB employees, and are routinely deleted after some time. For each node in the replica set spanning a time period that includes the incident, would you please archive (tar or zip) and upload to that link:
Regards, | ||||||||||||||||||
| Comment by Jordan Baczuk [ 30/Mar/23 ] | ||||||||||||||||||
|
For more info, a couple of applications use the database from a node.js application. Their driver versions are 4.13.0 and 4.9.1. I tested both individually, and they each had the same effect on the performance as running both at the same time. In other words, only running one of the applications did not solve the issue. |