[SERVER-2978] Stepping down when (network) storage is unavailable Created: 21/Apr/11  Updated: 30/Mar/12  Resolved: 02/Sep/11

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 1.6.5
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Pieter Ennes Assignee: Unassigned
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Replica set on Ubuntu 10.04 on Amazon EC2 nodes with sets of EBS volumes.


Participants:

 Description   

Today's (mostly EBS related) outage in Amazon AWS caused the same effect twice in our replica set.

The root cause was that the EBS volumes became unavailable, which in turn were mounted via mdadm and lvm. I can see that the kernel probably leaves the Mongo server just unaware or guessing here about what's going on, but what happened was that:

1) the master didn't step down, probably because its network was fine, and it didn't lag
2) all clients kept connecting to the node that didn't work

Would there be any way for the master, or the slaves to detect this situation and fail over?

Next to that, I am not aware of any option to have a slave manually step up either, instead of having the master step down. In the above scenario, the master didn't allow Mongo shell access because of the bad data partition, leaving no way to tell the master to step down, other then powering it off.

Cheers



 Comments   
Comment by Eliot Horowitz (Inactive) [ 22/Apr/11 ]

stepUp doesn't really make sense as to be safe it would have to tell the other server to stepDown, which it sounds like it wouldn't be able to do..

Comment by Pieter Ennes [ 21/Apr/11 ]

Hard to say a.t.m. as we had to power down the two previous master nodes to get fail-over, and Amazon is still working on recovery; will check when they come back.

Is there anything in favour of adding a stepUp function next to stepDown?

Comment by Eliot Horowitz (Inactive) [ 21/Apr/11 ]

Were there errors in the mongo log or was it just hanging?
If there were actual file system errors, we could trigger off of that.
If its just hanging, its pretty tricky to say for sure what to do.

Generated at Thu Feb 08 03:01:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.