[SERVER-18041] Support parallel cloning during initial sync Created: 14/Apr/15 Updated: 08/Jan/24 |
|
| Status: | Investigating |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Crystal Horn | Assignee: | Backlog - Replication Team |
| Resolution: | Unresolved | Votes: | 10 |
| Labels: | PM248, initialSync, pmr | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Assigned Teams: |
Replication
|
||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||
| Comments |
| Comment by Ramon Fernandez Marina [ 11/Mar/17 ] |
|
Hi Roy, Unfortunately initial sync is not resumable in 3.4 yet; I believe that work is defined in these tickets. If I'm not mistaken, there were 512 tickets related to replication in the 3.4 development cycle, 92 of which mention inital sync. While there are many ways that initial sync has been improved, I'm listing the highlights below:
In our internal sharded clusters, with live use and the balancer enabled, we've seen initial sync go from 5-7 days to a few hours. Hope this helps. Regards, |
| Comment by Roy Reznik [ 07/Mar/17 ] |
|
Hi Ramon, I watched that ticket. Roy. |
| Comment by Ramon Fernandez Marina [ 09/Nov/16 ] |
|
3.4 comes with faster, resumable initial sync. We're working on the documentation for these new features We've also published three release candidates. 3.4.0-rc2 is the latest at the time of this writing, and you can download it and test these features. If you do any testing and find any issues please open new SERVER tickets so we can investigate them. Thanks, |
| Comment by Roy Reznik [ 06/Nov/16 ] |
|
Is it still planned for 3.4? |
| Comment by Scott Hernandez (Inactive) [ 04/Jan/16 ] |
|
liranms, thanks for the pull request. I've added some comment there. Let's work on that until we have a plan, and then come back to jira for the next steps. dynamike, This has slipped from 3.2 as expected but we are working hard on getting this into the 3.4 release – to replace both the cloner and data (delta = oplog) replication process. We will have more time to discuss and understand the upstream consequences of increasing replication concurrency and how it will affect end users. The current plan is to support parallel copying at the collection level so we can support databases with a lot of collections or a small number of collections in a lot of databases. There may also be support for resuming the cloning process if the initial sync is stopped (like due to a system shutdown), so we can only clone the missing collections. |
| Comment by Liran Moysi [ 24/Dec/15 ] |
|
It is extremely important to support for parallel cloning, especially during the index build phase. Regarding the DOS attack that @scotthernandez mentioned, it's less relevant for the index building stage (which happens on the node itself) so paralleling this stage would not harm. |
| Comment by Michael Kania [ 18/Jun/15 ] |
|
Totally agree to keep the default initial sync rate heavily limited and having the ability to dynamically tune it the correct way to do it. Looking forward to the new Data Replicator stuff. |
| Comment by Scott Hernandez (Inactive) [ 18/Jun/15 ] |
|
It is not planned for 3.2 at this time. There are too many open questions about performance and the load it would create upstream on the sync source, to schedule it now. In addition there are currently no configurable options for initial sync, and without adaptive load monitoring/control, one would probably want to control the concurrency of how many collections are cloned at once at a minimum. We don't want to introduce a feature that can DOS attack other members in the replica set during initial sync – some people have actually seen problems with the current initial sync process causing performance degradation on live systems since it can't be throttled. The good news is that the new Data Replicator components, we will soon have internally, will allow us to support concurrent clones relatively easily when it is time. |
| Comment by Michael Kania [ 18/Jun/15 ] |
|
Is this planned for 3.2? |