-
Type: Task
-
Resolution: Done
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Sharding
-
None
-
Fully Compatible
-
Sharding 17 (07/15/16), Sharding 18 (08/05/16), Sharding 2016-08-29
If a moveChunk command fails for various reasons, the migration is abandoned. There are cases that must be evaluated to see on what kind of errors the migration should be retried. There must also be some kind of migration retry counter logic to make sure we don't reschedule a migration endlessly and never return from MigrationManager::scheduleMigrations.
A few specific things to think through:
moveChunk command errors in MigrationManager::_checkMigrationCallback
- should retry on network errors
- should retry on conflicting migration errors. We should really maintain a map of shards performing migrations initiated by the balancer so that we know when the conflict is because we already scheduled a migration with the shard or it's an external cause – and so we don't schedule the migration in the first place if we know it won't work.
- LockBusy errors when we already know it's an old 3.2 shard (second LockBusy error on moveChunk) – do we even want to reschedule?
MigrationManager::_executeMigrations
- scheduleRemoteCommand errors (callbackhandle check)
- has to be done after
-
SERVER-24853 Refactor Balancer code to use MigrationManager in order to move chunks in parallel
- Closed