Details
-
Improvement
-
Resolution: Done
-
Major - P3
-
None
-
1.6.5
-
None
-
Replica set on Ubuntu 10.04 on Amazon EC2 nodes with sets of EBS volumes.
Description
Today's (mostly EBS related) outage in Amazon AWS caused the same effect twice in our replica set.
The root cause was that the EBS volumes became unavailable, which in turn were mounted via mdadm and lvm. I can see that the kernel probably leaves the Mongo server just unaware or guessing here about what's going on, but what happened was that:
1) the master didn't step down, probably because its network was fine, and it didn't lag
2) all clients kept connecting to the node that didn't work
Would there be any way for the master, or the slaves to detect this situation and fail over?
Next to that, I am not aware of any option to have a slave manually step up either, instead of having the master step down. In the above scenario, the master didn't allow Mongo shell access because of the bad data partition, leaving no way to tell the master to step down, other then powering it off.
Cheers