[SERVER-18453] Avoiding Rollbacks in new Raft based election protocol Created: 13/May/15 Updated: 06/Dec/22 Resolved: 06/Sep/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Henrik Ingo (Inactive) | Assignee: | Backlog - Replication Team |
| Resolution: | Duplicate | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Assigned Teams: |
Replication
|
||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||
| Description |
|
Background The current election protocol gives preference to a node who has the most recent opTime in his oplog. Essentially out of a pool of "electable" nodes, the one with the most advanced opTime wins the election. The rationale for this is to minimize the amount of data rolled back due to a failover. The Raft protocol does not have such a component in the election protocol. We do not want to re-introduce it because then we could no longer rely on the robustness/correctness of the Raft protocol in our implementation. In fact it seems this part of the current election protocol has been a source of bugs (e.g. no master elected, etc...). I agree with this design. The result is that in MongoDB 3.2, users will be much more likely to see data rollbacked that was written with write concern < majority. From a strict point of view this is ok, as we never have guaranteed that such writes would be safe during failover. From a practical point of view users can today run with w:2 or even w:1 and expect to loose a minimal amount of data on failovers, and in MongoDB 3.2 they could suddenly lose hundreds of milliseconds of transactions and this is arguably a regression in our capability. Example test case (Note, this is based on my understanding of our current election protocol, I didn't actually verify this so far. If we want to address this problem at all, first step would be to run this test. If this test doesn't trigger the behavior I'm describing for 3.2, it will nevertheless be possible to construct a more contrived test case that will.) write concern = 2
In MongoDB 2.4-3.0, a primary failure is likely to cause one of the other NA secondaries to become primary, because they will of course be 100 ms ahead their EU peers in replication. In proposed MongoDB 3.2 design, all secondaries have equal probability to become primary. Therefore there is 50% chance that a EU node becomes primary and therefore the US secondaries would roll back 100 ms of their oplog. Proposal I believe it is possible to give more attention to not rolling back oplog as follows:
Benefits of my proposal:
Drawbacks of my proposal:
Proposed priority 3.1 Desired or less. The justification for this is that per documentation we don't promise that rollback wouldn't happen to non-majority committed data. |
| Comments |
| Comment by Eric Milkie [ 06/Sep/16 ] |
|
This idea was implemented as part of the work for |
| Comment by Henrik Ingo (Inactive) [ 15/Apr/16 ] |
I like this thinking. Makes sense. And yes, this is of course very unlikely to happen, so not suggesting we optimize for it, just that it is handled correctly one way or another. |
| Comment by Henrik Ingo (Inactive) [ 14/Apr/16 ] |
Scratch that. Now that I'm re-reading and refreshing my memory, what you say is correct.
The proposal surely addresses a corner case (at best... I must emphasize I didn't actually test any of this). For better or worse, I live with a mind that easily spots corner cases. |
| Comment by Andy Schwerin [ 12/Apr/16 ] |
|
I just re-read the problem description in this ticket, and I think the example is a little misleading. For one thing, if the two North American nodes are truly 100ms ahead of the two European nodes, they won't vote for either of the European nodes, denying them the majority required to win the election. The real difference between the old and new protocols is what happens when exactly one of the surviving nodes has a write that none of the other nodes have, and this condition survives until some node's election timeout expires. In the original election protocol (selected via protocolVersion: 0 in the replica set configuration document in MongoDB 3.2 and later), a node participating in an election may veto a candidate if it believes that it or some third node is up and has an operation in its oplog that the candidate does not have. In the new election protocol (protocolVersion: 1), a node can only refrain from voting for a candidate, and then only if it believes that it has a newer operation in its own oplog than the candidate. However, during the time between the old primary crashing and the candidate standing for election, nodes continue to fetch operations from each others' oplogs, improving the likelihood that the newest operations end up in the majority of nodes' oplogs. I suspect that in practice this lowers the odds of losing the write that is initially present in only one secondary when the original primary first crashes. The proposal is still interesting, as it could provide a useful knob for a user to use to balance the likelihood of losing w:2 writes in 5-node sets against the amount of time that must pass before the replica set becomes available for writes. |