[SERVER-77768] Prevent DDL ops and migrations from failing transitionFromDedicatedConfigServer Created: 02/Jun/23 Updated: 24/Jan/24 |
|
| Status: | Needs Scheduling |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Jack Mulrow | Assignee: | Kshitij Gupta |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | cs-subteam2 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Assigned Teams: |
Cluster Scalability
|
||||||||||||
| Participants: | |||||||||||||
| Description |
|
The transitionToDedicatedConfigServer command essentially wraps removeShard and once user data has been moved from the config server will remove the config server's shard document from config.shards. To enable transitioning from a dedicated config server back to a config shard, the transitionFromDedicatedConfigServer command adds an entry back to config.shards, essentially wrapping addShard. If a collection that exists locally on the config server conflicts with an existing namespace in the cluster, addShard and therefore transitionFromDedicatedConfigServer will fail, requiring the user to resolve the collision. To allow successive transitions, transitionToDedicatedConfigServer will locally drop sharded collections that have been drained of their chunks, after all range deletion tasks have run. Chunk migrations check if the recipient shard is draining only when committing, so the balancer may choose to move a chunk to the config shard, but it may successfully be removed before the migration completes. The migration will correctly fail, but it may leave orphaned data on the config server, which prevents a future transitionFromDedicatedConfigServer from succeeding without user intervention. There is a similar problem with renameCollection. This is unlikely in practice, because the balancer won't move a chunk to a draining shard, so the config shard must have no chunks when the transition to dedicated mode begins, and the removeShard waits for all local range deletion documents to be removed, so a migration started after the config shard starts to drain would have to take longer than the default orphan cleanup delay of 15 minutes to insert its range deletion task on the config server. The purpose of this ticket is to guarantee transitionFromDedicatedConfigServer can always succeed. Some possible approaches:
|