Priority: Major - P3
Resolution: Gone away
Affects Version/s: 3.0.4
Fix Version/s: None
Environment:Repro'd with Mongo 3.0.4 WT inside Docker and TCL 6.3
Steps To Reproduce:
Repro can be achieved using the Docker Compose spec and script attached to this ticket.
Public Gist of those same files is available here: https://gist.github.com/thieman/e28d426d3415c30caf38
Under the following conditions, the balancer can cause up to N mongod primary failures, where N is the number of collections currently eligible for balancing.
1. The balancer tries to move a chunk
2. During the moveChunk, the FROM primary enters the critical section
3. During the critical section, the FROM primary is partitioned from the first config server
4. The primary will crash
5. If the balancing mongos is still able to reach the first config server, it will release its balancing lock
6. If 5 happened, goto 1
After step 4, the collection lock will still be held on the collection that was being balanced at the time of the crash. This lock will expire after 24 hours, but during that time the collection is no longer eligible for balancing.
Critically, this failure cannot cascade to multiple nodes if the first config server is also partitioned from the balancing mongos. When this happens, the balancing mongos is unable to release its balancing lock, so balancing cannot continue. This is a good thing, as it prevents cascading primary failures.
Since this failure can only happen if the balancing mongos is still able to read and write from the config server, it seems like some more intelligent reading of the current lock state by that mongos would allow us to prevent the cascading failure. Perhaps something like this?
1. Balancing mongos realizes the migration has failed
2. Balancing mongos checks the collection lock on the collection it was just trying to migrate
3. If that lock is still held, mongos can reason that something has gone wrong and refuse to relinquish its balancing lock as a precaution
If the balancing mongos did this, it would prevent the cascading failure and give us 24 hours to respond to the issue before any other primaries could die.