[SERVER-68245] Degraded performance after 4.4 version upgrade Created: 23/Jul/22  Updated: 18/Oct/22  Resolved: 18/Oct/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.4.15
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Ishrat Jahan Assignee: Chris Kelly
Resolution: Done Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File blocks read vs vblocks written-1.png     PNG File blocks read vs vblocks written.png     PNG File blocks read.png     PNG File cache pages read and written to disk.png     PNG File cahce eviction.png     PNG File dirty cache.png     File metrics.2022-07-22T00-20-13Z-00000     File metrics.2022-07-22T05-45-13Z-00000    
Operating System: ALL
Participants:

 Description   

Hi,

We metrics.2022-07-22T00-20-13Z-00000 replicaset, with 3 nodes. We were earlier on version 4.2
Average Response time: 1.5 ms
p55: 5ms

The version upgrade to 4.4 was done around 21st July 10:00 pm. Immediately after the upgrade we did not notice any issues, but after a few hours, we started 

seeing an increase in response time and a lot of slow queries on mongo side.
Average Response time: 3.5 ms
p55: 15ms

Data size: 5TB(uncompressed)
Index size: 35GB

We also observed the following:

  • Increase in memory page faults
  • Dirty data and dirty pages in cache
  • Increase in blocks read
  • Decrease in blocks written
  • Increase in cache evictions
  • Increase in cache pages read
  • Decrease in cache pages written

Please find attached the screenshots and the diagnostic data.

Please help us in finding out the issue and suggest an appropriate fix for this.
Thanks in advance!metrics.2022-07-22T05-45-13Z-00000



 Comments   
Comment by Chris Kelly [ 18/Oct/22 ]

We haven’t heard back from you for some time, so I’m going to close this ticket. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Comment by Chris Kelly [ 04/Oct/22 ]

Tejas,

We still need additional information to diagnose the problem. If this is still an issue for you, would you please supply the requested information? Specifically, it is difficult to ascertain what degradation you're seeing without an example:

  • A workload that is known to be identical, such as a test workload that can be repeated
  • Diagnostic data and mongodb logs for comparative tests of that repeatable workload on both 4.2 and 4.4

Christopher

Comment by Chris Kelly [ 12/Sep/22 ]

Hi Tejas,

Thank you for your patience on this. In order to investigate this further, it is imperative we get something that is reproducible. If you are unable to downgrade to 4.2 to collect diagnostic data for comparison, it is at least necessary to provide a workload that is known to be cause this problem, such as a test workload that can be repeated on 4.2 and 4.4. 

As Eric mentioned, because of the differences in 4.2 and 4.4, asking our community for help by posting on the MongoDB Developer Community Forums may prove helpful for tuning your performance.

Regards,
Christopher

Comment by Tejas Jadhav [ 18/Aug/22 ]

Hello Eric,

We are not able to repeat this on other workload since this is production instance and it would be hard to replicate the traffic as well as the setup. Nonetheless, we are still seeing the response time spikes everyday. We also won't be able to provide any diagnostic data for 4.2 since that would require a downgrade. Is there anything else we can provide on the current setup?

Also, all times in the above screenshots are in IST (+05:30).

Comment by Eric Sedor [ 12/Aug/22 ]

Hi ishrat.jahan@gojek.com, and apologies for the delay addressing this report. I want to start by setting some expectations:

  • There are some underlying storage engine changes between 4.2 and 4.4 that could be impacting this workload, but the magnitude of of the changes will make a direct comparison difficult
  • In looking at the provided diagnostic data, there are a few concerns I have:
    • The time period of this data is ~2022-07-22T00:20:14.000Z to ~2022-07-22T07:54:00.008Z - If the timestamps in your provided screenshots are in UTC, that seems to coincide with the lowest load period
    • Within this range, the number of active writers and average latency maximums of ~20ms are very low in real terms. Such small ranges of performance variation can be very difficult to attribute given the 1 second sample rate of diagnostic data

To have the best chance of implicating a cause of a performance regression like this, we would really need to see:

  • A workload that is known to be identical, such as a test workload that can be repeated
  • Diagnostic data and mongodb logs for comparative tests of that repeatable workload on both 4.2 and 4.4

Are you able to provide this?

Finally, and because of the differences in 4.2 and 4.4, I might also suggest treating this as a performance tuning effort on 4.4 and asking our community for help by posting on the MongoDB Developer Community Forums. They might be able to help zero in on specific improvements to make.

Sincerely,
Eric

Comment by Ishrat Jahan [ 01/Aug/22 ]

Our p99 response time has spiked till 22ms. Can someone please help check this ? 

Generated at Thu Feb 08 06:10:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.