[SERVER-2978] Stepping down when (network) storage is unavailable Created: 21/Apr/11 Updated: 30/Mar/12 Resolved: 02/Sep/11 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 1.6.5 |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Pieter Ennes | Assignee: | Unassigned |
| Resolution: | Done | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Replica set on Ubuntu 10.04 on Amazon EC2 nodes with sets of EBS volumes. |
||
| Participants: |
| Description |
|
Today's (mostly EBS related) outage in Amazon AWS caused the same effect twice in our replica set. The root cause was that the EBS volumes became unavailable, which in turn were mounted via mdadm and lvm. I can see that the kernel probably leaves the Mongo server just unaware or guessing here about what's going on, but what happened was that: 1) the master didn't step down, probably because its network was fine, and it didn't lag Would there be any way for the master, or the slaves to detect this situation and fail over? Next to that, I am not aware of any option to have a slave manually step up either, instead of having the master step down. In the above scenario, the master didn't allow Mongo shell access because of the bad data partition, leaving no way to tell the master to step down, other then powering it off. Cheers |
| Comments |
| Comment by Eliot Horowitz (Inactive) [ 22/Apr/11 ] |
|
stepUp doesn't really make sense as to be safe it would have to tell the other server to stepDown, which it sounds like it wouldn't be able to do.. |
| Comment by Pieter Ennes [ 21/Apr/11 ] |
|
Hard to say a.t.m. as we had to power down the two previous master nodes to get fail-over, and Amazon is still working on recovery; will check when they come back. Is there anything in favour of adding a stepUp function next to stepDown? |
| Comment by Eliot Horowitz (Inactive) [ 21/Apr/11 ] |
|
Were there errors in the mongo log or was it just hanging? |