|
The proposed way to fix this is to try to acquire the "balancer" distributed lock first thing after the migration manager enters the recovering state. This acquisition must be done with a local write concern since at that point the server is still in draining mode.
If this aquisition fails for any reason, there are a couple of cases to be considered:
- There are no migration documents - in this case it is harmless since there are no active migrations to be recovered and the failure most likely happened because a 3.2 mongos after an upgrade had a lease on the lock, which will expire after 15 minutes and the lock will be acquired by the balancer thread, which guarantees that no new migrations will occur until then.
- There are some migration documents - this is the unexpected case and can only happen due to local write problem, which is a pretty severe condition in itself. In this case, just log a warning and continue recovery, since there is not much that can be done.
|