[SERVER-9081] Can't failover last SECONDARY node with Replica set. Created: 22/Mar/13 Updated: 10/Dec/14 Resolved: 02/Apr/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Paul DeCoursey | Assignee: | Nicholas Tang |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Steps To Reproduce: | create a 3 member replica set. |
||||||||||||
| Participants: | |||||||||||||
| Description |
|
this is a dupe of Give us a way to instruct a member to be primary without a majority. Let us be the arbiter. |
| Comments |
| Comment by Nicholas Tang [ 02/Apr/13 ] |
|
Paul, I'm going to close this out, as it is working as designed. Ultimately, the way I described things working is the only way to ensure automatic data consistency in the case of any network partition or failure; having a primary elected without a majority would allow the chance of multiple primaries, meaning inconsistency across the replicaset. If you have the time, this article is worth reading, as it goes through a detailed explanation much better than I can: The quick summary, again, is that MongoDB choses consistency over (write) availability, which is why this happens. It's not possible to have both, and it would require a large set of changes to change its behavior (not to mention potentially breaking the expected behavior for tens of thousands of clusters). I know that's probably not a satisfying answer, but it is the real one. Thanks, |
| Comment by Paul DeCoursey [ 22/Mar/13 ] |
|
So what you are saying is that if I loose more than a majority of nodes at any one time the cluster goes into read-only mode? And you are also saying we can't configure it to work in a more reasonable way? I want my servers to operate like people, not politicians. |
| Comment by Nicholas Tang [ 22/Mar/13 ] |
|
Paul, Elections look for a majority of all possible nodes, not just local nodes. In the scenario you described, the only location that would have a majority of votes would be the location with three nodes (3/5 > 1/2), which would elect a primary, and the other location would go into read-only mode (2/5 < 1/2). When the network partition was resolved, the secondaries in the 2-node location would proceed to resync with the primary. This still makes sure that we have data consistency across the cluster as a whole, if you use the writeconcern of majority in your app/ driver (more here). MongoDB's replication was designed to make sure that MongoDB wouldn't destroy or lose data. In order to manage that, we have to assume the worst when there are various possibilities of failure. That's why we provide the force option, to allow the operator to make a potentially dangerous configuration that they have determined is worth the risk (or that has no risk, as the case may be with the advantage of information the nodes don't have). Thanks, |
| Comment by Paul DeCoursey [ 22/Mar/13 ] |
|
You are solving for a problem that may or may not even exist. The scenario can still occur if for instance I had a 5 member replica set and the network split leave two on one side and three on the other. In that case the failover would be successful for both segments. |
| Comment by Nicholas Tang [ 22/Mar/13 ] |
|
Paul, We don't automatically have a node take over when it can't get a majority of votes, and there's a good reason for that (which I'll get into momentarily). However, if you want to override that, you can do that with rs.reconfig() using the force option. Details on that command are here. We have a tutorial in forcing this reconfiguration here that explains it step by step. If you use a centralized system for configuration management/ orchestration you could script this to happen automatically. Even without electing a new primary, though, the remaining node will still allow requests to come in for reads, which can be configured at the app/driver level. That's explained in more detail here and would allow your application to go into read-only mode in those scenarios until the majority could be resolved (or forced). Now, as to why we have that behavior: when two of the three nodes in a replicaset go offline because two of the hosts are confirmed down, you're absolutely correct that there's no reason why the third member shouldn't just take over. The problem is when you have a network split (or partition) - for instance, if two nodes are in one location, and the third is in a 2nd location. It's entirely possible to have a network issue that allows your application servers to see all three nodes, but to have the nodes at one location not see the other(s) (for instance, if you have app servers in both locations and GSLB (aka global, or geo, server load balancing) sending requests to both sets of app servers). In those scenarios, if we didn't require a majority for election, you'd end up with two primaries: one in location A, and one in location B, both accepting writes, and both with datasets now diverging. Unfortunately, the nodes in the replicaset have no way of confirming whether the other hosts are actually down, or just inaccessible, and that's the key to this - if they could determine definitively that the other nodes were down, they could theoretically then elect a new primary even without majority. Requiring the majority prevents that, full stop. It also makes for more challenging configuration, as you need to be aware of the full ramifications of various outage/ problem scenarios when designing the layout of your replicaset - but it allows us to keep data consistency when that happens. Does that make sense? I'm happy to answer any questions or give you more detail if you'd like it. Thanks, |