[SERVER-28405] Performance degradation 3.4.2 vs 3.2.11 Created: 21/Mar/17  Updated: 27/Oct/23  Resolved: 03/Apr/17

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.4.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Luke Morfitt Assignee: Bruce Lucas (Inactive)
Resolution: Gone away Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2017-03-21 at 09.49.22.png     PNG File event.png    
Operating System: ALL
Participants:

 Description   

Hello,

We have a rather large mongo replica set containing 19,000 dbs. We have an outstanding issue with wiredTiger cache as per https://jira.mongodb.org/browse/SERVER-27700 and https://jira.mongodb.org/browse/SERVER-27753.

As we are expecting the change to be in the next 3.4.x-rc we upgraded our cluster from 3.2.11 to 3.4.2 to understand the current performance characteristics and hope that we would gain some improvement.

Since the upgrade (less than 24 hours) our jobs (which use this cluster) are now taking many hours to finish. In fact some of them have not finished running and we have canceled them.

The following a screen shot of job run times and how mongo 3.2.11 / 3.4.2 impacts the run time.

Please can you provide an upload link and I will provide the diagnostic data for the cluster.

Kind regards,
Luke



 Comments   
Comment by Luke Morfitt [ 30/Mar/17 ]

Hi,

Sorry for the delay getting back to you.

As 3.4.3 has been released I would like to close this ticket and retest. 3.4.3 has the performance fix we have been waiting for.

Thank you for your continued support,
Luke

Comment by Bruce Lucas (Inactive) [ 22/Mar/17 ]

Hi Luke,

I'd like to add another question: your chart shows a restart under 3.2.11 at 20.03 00:00, but I'm not seeing that in the logs or diagnostic data you uploade. Can you clarify what node was restarted?

Thanks,
Bruce

Comment by Bruce Lucas (Inactive) [ 22/Mar/17 ]

Hi Luke,

Thanks for uploading the data, we're taking a look.

I'd like to correlate the events in your chart with our data. Can you clarify:

  • What is the timezone of the times shown on your chart?
  • I think each of the triangles on your chart is a single job and the line represents cumulative time for that job so the height represents the total time for the job, is that correct?
  • The jobs that started prior to about 20:45 under 3.4.2 finished in 30-60 minutes, which is similar to the time you are seeing under 3.2.11 with a full cache, whereas you are looking for times under 30 minutes, like you see after a restart under 3.2.11, am I understanding correctly?
  • The job that started at about 20:45 took many hours, and that cannot be explained by the job doing more work, correct?

Regarding the jobs can you clarify

  • How much variability is inherent in the size of the jobs? I understand that variability in the jobs does not account for the slowness you are seeing, but I am interested to maybe correlate job run time with the internal state in mongod, but to do that I need to know how much variability is inherent in the jobs themselves.
  • Do the jobs do any work on the secondaries, or do they depend only on the primary? This dependency could include both direct dependency, like secondary queries, or indirect dependency, like writes with majority write concern. We noticed that there are some secondary queries and also some secondary lag, and want to understand what impact secondary performance might have on the job runtime issue you are seeing.

Thanks,
Bruce

Comment by Luke Morfitt [ 21/Mar/17 ]

Hi anonymous.user,

Thank you. I've uploaded the data from all the nodes in the replica set.

Kind regards,
Luke

Comment by Kelsey Schubert [ 21/Mar/17 ]

Hi lukemorfitt,

I've created a secure portal, would you please upload the the complete log files as well as the diagnostic.data?

Thank you,
Thomas

Generated at Thu Feb 08 04:18:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.