[SERVER-50027] Freeze migrations on original sharded collection while the resharding operation is in progress Created: 30/Jul/20 Updated: 29/Oct/23 Resolved: 11/Nov/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 4.9.0 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Blake Oler | Assignee: | Kaloian Manassiev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | PM-234-M2, PM-234-T-lifecycle | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||
| Sprint: | Sharding 2020-09-21, Sharding 2020-10-05, Sharding 2020-10-19, Sharding 2020-11-02, Sharding 2020-11-16 | ||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||
| Description |
Add an optional isFrozen field for each collection in config.collections. |
| Comments |
| Comment by Githook User [ 11/Nov/20 ] |
|
Author: {'name': 'Kaloian Manassiev', 'email': 'kaloian.manassiev@mongodb.com', 'username': 'kaloianm'}Message: |
| Comment by Githook User [ 03/Nov/20 ] |
|
Author: {'name': 'Kaloian Manassiev', 'email': 'kaloian.manassiev@mongodb.com', 'username': 'kaloianm'}Message: ... and make it required |
| Comment by Githook User [ 02/Nov/20 ] |
|
Author: {'name': 'Kaloian Manassiev', 'email': 'kaloian.manassiev@mongodb.com', 'username': 'kaloianm'}Message: |
| Comment by Githook User [ 30/Oct/20 ] |
|
Author: {'name': 'Kaloian Manassiev', 'email': 'kaloian.manassiev@mongodb.com', 'username': 'kaloianm'}Message: This change gets rid of the writing of the 'dropped' field on the |
| Comment by Githook User [ 22/Oct/20 ] |
|
Author: {'name': 'Kaloian Manassiev', 'email': 'kaloian.manassiev@mongodb.com', 'username': 'kaloianm'}Message: This change just starts the basis so that the CollectionType can be |
| Comment by Haley Connelly [ 13/Oct/20 ] |
|
kaloian.manassiev it would be great if y'all could pick this up so we can focus on other resharding MS1 tickets! |
| Comment by Kaloian Manassiev [ 06/Oct/20 ] |
|
haley.connelly, since PM-1965 requires this functionality too, does it make sense that we pick it up? |
| Comment by Haley Connelly [ 05/Oct/20 ] |
|
setting aside until higher priority milestone 1 tickets are taken care of. |
| Comment by Kaloian Manassiev [ 10/Sep/20 ] |
|
I was envisioning that this command will not just disable the balancer thread (in fact it will not disable it). It would do exactly what you are suggesting, to add the noBalance field, but also it needs to join the current balancer round and also needs to join any currently ongoing manually issued moveChunk's for that collection. Basically it needs to serve as a fence that after it successfully returns, there are no moveChunk(s) running anywhere, for that collection. |
| Comment by Max Hirschhorn [ 10/Sep/20 ] |
I'm new to sharding conversations like this so I wasn't sure when people write "balancer" if they mean "chunk splits and migrations including those triggered explicitly by user commands" or if people mean "the background thread which runs on the config server primary". Disabling the balancer thread isn't sufficient to prevent all sharding metadata operations on the collection being resharded. Acquiring the distributed locks on the database and collection being resharding are likely necessary too. I was imagining resharding would do {$set: {noBalance: true}} to the config.collections document for the collection being resharded as part of the same transaction which inserts the document into the config.reshardingOperations collection. (And it would preserve the original noBalance setting so it can be restored after the resharding operation completes.) Setting the 'noBalance' field is a best-effort to make the balancer thread not do wasted work where it could somehow start a migration for the collection being resharded on one of the shards. No matter the noBalance setting, committing a chunk migration on the collection being resharded must be mutually exclusive with allowing the resharding operation to succeed. |
| Comment by Kaloian Manassiev [ 09/Sep/20 ] |
|
max.hirschhorn explained to me over Slack the correctness considerations why we need to disable balancing for a collection being resharded and I am fine with that (as much as I don't love it). However, may I ask that the ability to block balancing for a particular collection be implemented as a separate internal command, which can be tested in isolation from the resharding code and is not coupled with it? That way we can make sure that it properly joins ongoing migrations, etc and also it can be reused. I was imagining an internal _configsvrBalancerFreezeForCollection: "DB.Coll" or something like that. If the collection doesn't exist, it can return NamespaceNotFound. |
| Comment by Blake Oler [ 09/Sep/20 ] |
|
To be more specific – we necessitate disabling the balancer for both the source and destination collections, because the collection cloning and oplog fetching depend on the routing table not changing. Amusingly enough, the Slack conversation that I pulled the original context from notes that you didn't want the balancer to need to be disabled, but we had to do it anyway kaloian.manassiev |
| Comment by Blake Oler [ 09/Sep/20 ] |
|
Hey Kal, the balancer being disabled has been written as step 1.1 in the high-level design for a while now, and the current implementation relies on its being disabled. If you'd like to talk about this more, let haley.connelly and me know and we'll schedule a Zoom call. |
| Comment by Kaloian Manassiev [ 09/Sep/20 ] |
|
Would it be possible to not disable the balancer for neither the source nor the destination collection, in order to ensure that we adequately handle all cases where the versions change, throw StaleShardVersion, etc. I really don't want that we write the resharding cloning code in a way which relies on the balancer being stopped for correctness. The disabling of the balancer should be just for performance reasons. |