[SERVER-44948] High cache pressure in 4.2 replica set vs 4.0 Created: 04/Dec/19  Updated: 29/Oct/23  Resolved: 28/Jan/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.2.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Michael Smith Assignee: Dmitry Agranat
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File SERVER-44991.svg     PNG File wt_row_leaf_key_work.png    
Issue Links:
Related
related to SERVER-44991 Performance regression in indexes wit... Closed
is related to SERVER-44881 Giant slow oplog entries are being lo... Backlog
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

I have a 3-node replica set running version 4.2.1 on Ubuntu 18.04. Previously we had been on 4.0.4. I'm noticing that the cluster is significantly slower than it was in 4.0.4 both in terms of bulk insert speed on the primary, impact on the speed of other operations while those inserts jobs are running, and the replication lag we see on the secondaries.

I think at least on the primary this has something to do with elevated cache pressure, but I'm kind of just guessing. I'm happy to provide diagnostic files and logs privately.

We're now running with featureCompatibilityVersion 4.2, flow control off, and enableMajorityReadConcern false, and have had some improvement, but we're still in trouble with cache pressure and insert speed, and the replication lag is quite high when we're running jobs that insert millions of records.

Some performance numbers, each with upwards of 15-25m rows:

  1. With all nodes on 4.0.4: a job averaged 875 rows/sec. The local secondary mostly kept up (maybe 10-60 seconds lag). The geographically remote secondary was never more than 300 seconds behind.
  2. With the secondaries on 4.2.1 and the primary on 4.0.4: a job averaged 484 rows/sec. Both secondaries were about 5000-10000 seconds behind by the end of the job.
  3. On a similar but standalone node on 4.2.1, the job averaged 4012 rows/sec
  4. With all nodes on 4.2.1 and default settings for flowControl and enableMajorityReadConcern: average 265 rows/sec and all other operations were significantly impacted.
  5. After disabling flowControl: average 712 rows/sec, but both replicas were about 5000 seconds lagged and all other operations were significantly delayed, for example I saw a document update take 62673ms with {{timeWaitingMicros: { cache: 62673547 }}}
  6. After disabling flowControl and setting enableMajorityReadConcern to false: no specific insert number but I'm seeing "tracked dirty bytes in the cache" at about 1.5 GB while an insert job is running. One secondary is lagged by 4 sec and the other by 1183 sec.

A couple of notes on our setup:

  • Our bulk inserts are of ~2.5 KB documents with a large number of indexes (20, one of which is multi-key - array of strings);
  • We do have a couple of change stream listeners (not on the same collection being written to by bulk writes) so some level of majority read concern is still being supported (I suppose)


 Comments   
Comment by Dmitry Agranat [ 28/Jan/20 ]

Hi mzs,

MongoDB 4.2.3 was just released with the fix for SERVER-44991. I will go ahead and close this ticket but do not hesitate to reopen if this issue reoccurs.

Regards,
Dima

Comment by Dmitry Agranat [ 18/Dec/19 ]

Hi mzs, I am checking this information and will update you on the progress.

Comment by Michael Smith [ 17/Dec/19 ]

Hi Dmitry,

It looks like there's been some progress in SERVER-44991 and a possible fix has been committed.

Would you be able to offer any guidance on when it would make it into a 4.2.x patch release? (for example: 2 weeks, 1 month, 2 months...) That will help me come up with a mitigation strategy until then.

Thanks,

Mike

Comment by Michael Smith [ 09/Dec/19 ]

Thanks! I'll keep an eye on SERVER-44991. Much appreciated!

Comment by Dmitry Agranat [ 09/Dec/19 ]

Thanks mzs, with the latest information you've provided, we are suspecting this is related to SERVER-44991. We are still investigating SERVER-44991 and I will keep you updated on the progress.

Comment by Michael Smith [ 09/Dec/19 ]

I've uploaded SERVER-44948-perf-prod.zip with more interesting data.

  • ~22:52:45 - bulk insert started
  • by 00:02:25 one secondary was 600+ seconds lagged, the other about 1000, and small writes to other collections were taking 1-3 seconds with things like timeWaitingMicros: { cache: 1961686 }. Not as bad as I've seen it, but I figured it'd be representative.

(this was with enableFlowControl: false and enableMajorityReadConcern: false)

Comment by Michael Smith [ 08/Dec/19 ]

Hi Dima,

I've uploaded SERVER-44948-perf-dev.zip with the perf script output, diagnostic data, and syslog from an insert job in our development environment covering 2019-12-08 17:42:17-17:59:19 with 4.2.1. It may not be interesting - the test file was probably too simple to really slow things down. There was some cache pressure briefly at 2019-12-08 17:50 UTC. The secondaries never got more than about 5 seconds out of sync, even with flow control disabled.

I'll run a larger job in prod overnight, but figured I'd send these for now in case they're sufficient.

I don't think I'm able to downgrade a cluster to 4.0.x as we've created a number of new indexes since the upgrade. Hopefully something in the dev or prod perf traces from 4.2.1 will be large enough to spot.

Thanks for taking a look at this.

Mike

Comment by Dmitry Agranat [ 08/Dec/19 ]

Hi mzs,

Thank you for uploading all the requesting information, it was very useful. We currently suspect a certain regression and will need to collect some additional information.

In order for us to understand the difference in the reported CPU utilization, we'd like to collect some perf call stacks. This will require the installation of Linux perf tool.

Before/after comparison would be best, but the suspected regression may be large enough that just data for 4.2.1 would be useful.

Execute this during the insert workload. This command will capture call stack samples in separate files of 60 seconds each

while true; do perf record -a -g -F 99 -o perf.data.$(date -u +%FT%TZ) sleep 60; done

Then run perf script as above on the subset of files of interest

for fn in ...; do perf script -i $fn >$fn.txt; done

Once completed, please upload all of the generated .txt files and a fresh archive of the diagnostic.data.

Note that it is important to run perf script on the same node where perf.data was generated so that it can be correctly symbolized using the addresses on that machine.

Thanks,
Dima

Comment by Michael Smith [ 06/Dec/19 ]

I think we're in luck, I was able to find diagnostic data and logs from a few trials. I've uploaded SERVER-44948.zip for you.

  1. 2019-11-28 17:32:10 - 19:45:51 UTC: all members 4.0.4, ~1m doc insert job, no significant impact on other operations, no significant lag. (I'd downgraded the secondaries from 4.2.1 back to 4.0.4 before this) See syslog.7 for mongod logs.
  2. 2019-12-01 20:16:47 - 2019-12-02 13:31:33 UTC: all members 4.2.1, fcv 4.2, flowControl on (default), majority read concern enabled (default), ~16m doc insert job, other operations significantly delayed. For example, at 2019-12-02 10:07:34 in syslog.3 I can see a single-document update in a small collection (unrelated to the bulk insert job) took 31032 ms.
  3. 2019-12-05 22:35:00 UTC until now: all members 4.2.1, fcv 4.2, flowControl off, majority read concern disabled, ~5m doc insert job, significant replication lag, other operations significantly delayed. See syslog for mongod logs.
Comment by Dmitry Agranat [ 05/Dec/19 ]

Hi mzs

Thanks for the report. We'd like to compare 4.0.4 and 4.2.1 workloads under default configuration (your example #4). Is it possible to provide these two separate sets of the diagnostic.data? I am asking because the diagnostic.data retention under heavy load (which you've described) might only held about a week of data.

If you still have both 4.0.4 and 4.2.1 data for such comparison, please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) from all members of the replica set and upload them to this support uploader location.

Please also note the exact time and timezone of each workload you'd like us to compare.

If the data from 4.0.4 is no longer available, we might need to consider another approach.

Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Thanks,
Dima

Generated at Thu Feb 08 05:07:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.