[SERVER-24871] Add shard split feature (mitosis) Created: 01/Jul/16  Updated: 06/Dec/22  Resolved: 07/Jul/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: New Feature Priority: Major - P3
Reporter: Dai Shi Assignee: [DO NOT USE] Backlog - Sharding Team
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Sharding
Participants:

 Description   

Rationale – In a large sharded cluster with heavy insert/update load, data balancing across shards via chunk migration is extremely slow, especially in comparison with the rate of data ingestion. Adding new shards will often take months in order for data to fully rebalance. Often times the extra load/overhead of chunk migrations also cause the cluster to fail to serve actual traffic, thus making it impossible to keep the balancer running.

Even with the new parallel chunk migration strategy in 3.4, rebalancing to a single new shard will not be improved, since you can still only move one chunk at a time to the new shard. This means that when operating such a cluster, you have to keep on top of planning and provisioning sufficient extra capacity months ahead of time. Otherwise you risk getting into a situation where load or disk utilization on existing shards is far too high, and it will then take months of running in a degraded state until data balances to alleviate the pain.

Proposed feature – Add a new admin command similar to split chunk, but instead of creating a new chunk, create a new shard and update the cluster metadata such that roughly half the chunks on the shard being split are now owned by the new shard.

How this might work in practice – Say you have a shard that is a replica set of 3 mongoDs. First you would add an additional 3 (or more) replicas to the replica set, and wait for them to complete their initial sync. Then you can issue the new split shard command, and designate which nodes will belong to the new shard being created, so you end up with two replica sets with 3 (or more) replicas each. The chunk metadata can then be updated such that half the chunks on the original shard will belong to the newly created shard, and then each shard can delete the data in the half of the chunks they no longer own.

With this mechanism, adding a new shard that has balanced data will be able to be completed in seconds or minutes instead of months.



 Comments   
Comment by Dai Shi [ 19/Jul/16 ]

In my experience the rate of the sync phase of a chunk migration is still extremely slow compared to initial sync. I have watched the sync phase (just tailing logs watching for "cloned" messages) on one of our largest clusters for a single chunk move take upwards of 4-5 minutes, to move around 20MB of data. At this rate, it would still take over 10 days just to do the sync, where as on the same cluster, the initial sync process takes around 10 hours. This is the case for a few of our clusters, and is why I am skeptical that chunk migrations will ever approach the speed of initial sync. Maybe this only affect our deployments? I would be curious to see how this compares in other deployments.

Comment by Andy Schwerin [ 14/Jul/16 ]

Those are interesting proposals. However, I still believe that improving chunk migration should work at least as well without requiring the development of a correct distributed protocol for the above reconfigurations. My reasoning is as follows.

Chunk migration consists of a phase like initial sync, where documents are copied and writes are replicated from the original chunk owning shard to the recipient. That phase is followed by a phase where metadata is updated to change ownership of the chunk, and finally a phase where the donor shard deletes documents it no longer owns. The mitosis proposals essentially boil down to doing the initial sync, updating the metadata and then not doing the deletes, or deferring the deletes into the future. On heavily loaded shards, those deletes are what cause the performance problem, so if we could adapt chunk migration to let those deletes be deferred and improve the initial sync implementation used by chunk migration, we should see similar performance to initial sync for new replica set members.

This still leaves the impact of new user traffic on the recipient shard after the first chunk arrives, but I suspect that this traffic is less important for the performance of the chunk migration than the cost of paging in and cleaning up cold indexes during post-move chunk data cleanup, and also that extra traffic on the recipient corresponds to traffic removed from the donor (minus scatter gather writes sent to all shards).

Comment by Dai Shi [ 08/Jul/16 ]

I am fairly certain there is are ways to do this without any downtime, though it requires building some extra logic:

Option 1:
1) allow N new replicas to finish initial sync
2) stop balancer, issue split shard command to have the N new replicas form a new shard
a) at this point all writes remain on the initial primary, the N new shards will elect a primary among themselves
b) all replicas in the new shard continue to do replication from original primary
3) one chunk at a time, update chunk metadata to transfer ownership from original shard to new shard, stop replicating writes to those chunks from the original primary and start replicating from the new primary
4) continue until half the original chunks have been transferred
5) on both shards, delete the chunks they no longer own

Admittedly, this is kind of messy since nodes are replicating from 2 different sources, and is non-trivial to implement correctly.

Option 2:
1) allow N new replicas to finish initial sync
2) stop balancer, issue split shard command to have the N new replicas form a new shard
3) block writes until the new shard can elect a primary (this is no worse than stepping down a primary currently)
4) once a primary is elected, send all writes to the original shard to both the original primary and the newly elected primary
5) one at a time, update chunk metadata to transfer ownership from original shard to new shard until half the chunks are owned by the new shard
6) switch writes back to normal, such that now writes to a chunk only go the shard that owns it
7) on both shards, delete the chunks they no longer own

This is much cleaner, as both shards retain the same copy of the data during the entire critical portion of the operation, and no changes need to be made on the mongoD side. There might be a slight increase in latency to writes since we have to confirm writes to both shards during the operation.

I'm sure there are other clever ways to accomplish this as well, these are just a couple examples. We have done similar kinds of operations that require tailing the oplog from one cluster and writing updates into another while copying the data to a new cluster, and then switching reads then writes over, all without taking any downtime, so this seems definitely feasible.

As far as improvements to chunk migrations – my take is that chunk migrations are a fundamentally different workload than initial sync. The node doing initial sync does not have to serve any traffic and does not add any additional load to other nodes except for some slight read load to the node it is syncing from, where as a chunk migration adds significant additional load to every single replica on both the donor shard and target shard, all of which are serving live traffic. The heavier the traffic load there is, the worse chunk migrations will perform. It is therefore very hard for me to believe chunk migrations will ever approach the speed and transparency in terms of performance impact as initial sync. Perhaps you can elaborate on the improvements you are planning to make to the migration process?

Comment by Andy Schwerin [ 08/Jul/16 ]

You must disable writes to the cluster from before you remove a mitosis candidate from the original shard until you finish updating the metadata to assign chunks to the new shard. Otherwise, writes to chunks eventually assigned to the new shard might be lost.

I think you are underestimating the room for improvement in the balancer. Balancing and initial sync should be nearly identical operations, minus the cleanup which could be deferred if the original shard has sufficient storage space.

Comment by Dai Shi [ 07/Jul/16 ]

Which part requires downtime? As far as I'm aware, none of what I described requires downtime.

Also as of right now, chunk migrations produce a lot of extra stress on a running mongoD that initial syncs do not. Every time a chunk migration happens, the delete phase blows up WT cache, and you also have to make a config metadata change. In many real live production deployments, you can't even run the balancer, or can only run it during certain parts of the day. I have never heard of a deployment where initial sync wasn't able to be done.

In our clusters, initial sync takes roughly 3-10 hours, where as balancing to a newly created shard takes multiple months. I am skeptical that chunk migration can be improved to a degree with it would be as fast as an initial sync on a heavily loaded cluster.

Comment by Andy Schwerin [ 07/Jul/16 ]

Mitosis as described here requires system down time, as do all variations that I'm aware of. As such, they can be better carried out by an orchestration tool rather than within MongoDB itself.

Also, if the proposed solution involves doing three initial syncs before the mitosis, I suspect that an equal amount of engineering effort applied to improving chunk migration would be as fast and require no downtime.

Comment by Kelsey Schubert [ 01/Jul/16 ]

Hi dai@foursquare.com,

Thank you for creating this feature request! I'm marking it to be scheduled during the next round of planning - please continue to watch for updates.

Kind regards,
Thomas

Generated at Thu Feb 08 04:07:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.