[SERVER-49223] Suggestions to speed up initial sync Created: 01/Jul/20  Updated: 28/Jul/20  Resolved: 28/Jul/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.2.8
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Oliver Yeh Assignee: Dmitry Agranat
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Participants:

 Description   

I have a 5TB, 2 collection instance that I need to move from zlib to zstd.  The initial sync is painfully slow right now on a very beefy secondary instance (m5.12xlarge, 192GB of RAM, 48vCPUs).  The initial sync is scheduled to be completed in 8 days.

 

Primary instance is on an even beefier machine with no load.  I tried diagnosing the slowdown and determined

 - disk is not saturated with iostat

 - cpu is not saturated with htop.  

 - changing instance type to increase/decrease RAM 

 - played around with maxIndexBuildMemoryUsageMegabytes with setParameter

 - played around with replWriterThreadCount with setParameter

 

It seems like the instance can do a lot more with disk + CPU not saturated on both the primary and the secondary.  Is there anything else I can try?



 Comments   
Comment by Dmitry Agranat [ 12/Jul/20 ]

Hi oliver@sensortower.com,

Yes, if possible please upload the data covering this process from the start, both from syncing secondary and from the primary. 200MB of diagnostic data should cover these 8 days. Just as fyi, the "metric" file you've uploaded just covers 5 hours so it's better to upload the whole archive of diagnostic.data.

From the period of time covering these 5 hours, we can see that setting replWriterThreadCount to 32 is making things worse. Checkpoint is not keeping up with the demand of replicating 23k write operations. Is is possible to gather all the requested information under the default configuration?

Thanks,
Dima

Comment by Oliver Yeh [ 08/Jul/20 ]

I uploaded what I could.  Some of the log files have been overwritten (apparently only 200MB on the diagnostic.data?).  If that is not enough, we can close the issue and I can reopen it next time I do a full resync.  Thank you!

Comment by Dmitry Agranat [ 07/Jul/20 ]

Hi oliver@sensortower.com,

Would you please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) from both syncing secondary and the primary covering the time of the initial sync and upload them to this support uploader location?

Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Thanks,
Dima

Comment by Oliver Yeh [ 02/Jul/20 ]

4.2.8

Comment by Dmitry Agranat [ 02/Jul/20 ]

Hi oliver@sensortower.com,

What MongoDB version do you use during the initial sync?

Thanks,
Dima

Generated at Thu Feb 08 05:19:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.