[SERVER-44948] High cache pressure in 4.2 replica set vs 4.0 Created: 04/Dec/19 Updated: 29/Oct/23 Resolved: 28/Jan/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 4.2.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Michael Smith | Assignee: | Dmitry Agranat |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Operating System: | ALL | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
I have a 3-node replica set running version 4.2.1 on Ubuntu 18.04. Previously we had been on 4.0.4. I'm noticing that the cluster is significantly slower than it was in 4.0.4 both in terms of bulk insert speed on the primary, impact on the speed of other operations while those inserts jobs are running, and the replication lag we see on the secondaries. I think at least on the primary this has something to do with elevated cache pressure, but I'm kind of just guessing. I'm happy to provide diagnostic files and logs privately. We're now running with featureCompatibilityVersion 4.2, flow control off, and enableMajorityReadConcern false, and have had some improvement, but we're still in trouble with cache pressure and insert speed, and the replication lag is quite high when we're running jobs that insert millions of records. Some performance numbers, each with upwards of 15-25m rows:
A couple of notes on our setup:
|
| Comments |
| Comment by Dmitry Agranat [ 28/Jan/20 ] | ||
|
Hi mzs, MongoDB 4.2.3 was just released with the fix for Regards, | ||
| Comment by Dmitry Agranat [ 18/Dec/19 ] | ||
|
Hi mzs, I am checking this information and will update you on the progress. | ||
| Comment by Michael Smith [ 17/Dec/19 ] | ||
|
Hi Dmitry, It looks like there's been some progress in Would you be able to offer any guidance on when it would make it into a 4.2.x patch release? (for example: 2 weeks, 1 month, 2 months...) That will help me come up with a mitigation strategy until then. Thanks, Mike | ||
| Comment by Michael Smith [ 09/Dec/19 ] | ||
|
Thanks! I'll keep an eye on | ||
| Comment by Dmitry Agranat [ 09/Dec/19 ] | ||
|
Thanks mzs, with the latest information you've provided, we are suspecting this is related to | ||
| Comment by Michael Smith [ 09/Dec/19 ] | ||
|
I've uploaded
(this was with enableFlowControl: false and enableMajorityReadConcern: false) | ||
| Comment by Michael Smith [ 08/Dec/19 ] | ||
|
Hi Dima, I've uploaded I'll run a larger job in prod overnight, but figured I'd send these for now in case they're sufficient. I don't think I'm able to downgrade a cluster to 4.0.x as we've created a number of new indexes since the upgrade. Hopefully something in the dev or prod perf traces from 4.2.1 will be large enough to spot. Thanks for taking a look at this. Mike | ||
| Comment by Dmitry Agranat [ 08/Dec/19 ] | ||
|
Hi mzs, Thank you for uploading all the requesting information, it was very useful. We currently suspect a certain regression and will need to collect some additional information. In order for us to understand the difference in the reported CPU utilization, we'd like to collect some perf call stacks. This will require the installation of Linux perf tool. Before/after comparison would be best, but the suspected regression may be large enough that just data for 4.2.1 would be useful. Execute this during the insert workload. This command will capture call stack samples in separate files of 60 seconds each
Then run perf script as above on the subset of files of interest
Once completed, please upload all of the generated .txt files and a fresh archive of the diagnostic.data. Note that it is important to run perf script on the same node where perf.data was generated so that it can be correctly symbolized using the addresses on that machine. Thanks, | ||
| Comment by Michael Smith [ 06/Dec/19 ] | ||
|
I think we're in luck, I was able to find diagnostic data and logs from a few trials. I've uploaded
| ||
| Comment by Dmitry Agranat [ 05/Dec/19 ] | ||
|
Hi mzs Thanks for the report. We'd like to compare 4.0.4 and 4.2.1 workloads under default configuration (your example #4). Is it possible to provide these two separate sets of the diagnostic.data? I am asking because the diagnostic.data retention under heavy load (which you've described) might only held about a week of data. If you still have both 4.0.4 and 4.2.1 data for such comparison, please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) from all members of the replica set and upload them to this support uploader location. Please also note the exact time and timezone of each workload you'd like us to compare. If the data from 4.0.4 is no longer available, we might need to consider another approach. Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time. Thanks, |