[SERVER-18453] Avoiding Rollbacks in new Raft based election protocol Created: 13/May/15  Updated: 06/Dec/22  Resolved: 06/Sep/16

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Henrik Ingo (Inactive) Assignee: Backlog - Replication Team
Resolution: Duplicate Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-23663 New primary syncs from chosen node to... Closed
is duplicated by SERVER-22502 Replication Protocol 1 rollbacks are ... Closed
Related
related to SERVER-14539 Full consensus arbiter (i.e. uses an ... Backlog
related to SERVER-11086 Election handoff to new primary, duri... Closed
Assigned Teams:
Replication
Participants:
Case:

 Description   

Background

The current election protocol gives preference to a node who has the most recent opTime in his oplog. Essentially out of a pool of "electable" nodes, the one with the most advanced opTime wins the election. The rationale for this is to minimize the amount of data rolled back due to a failover.

The Raft protocol does not have such a component in the election protocol. We do not want to re-introduce it because then we could no longer rely on the robustness/correctness of the Raft protocol in our implementation. In fact it seems this part of the current election protocol has been a source of bugs (e.g. no master elected, etc...). I agree with this design.

The result is that in MongoDB 3.2, users will be much more likely to see data rollbacked that was written with write concern < majority. From a strict point of view this is ok, as we never have guaranteed that such writes would be safe during failover. From a practical point of view users can today run with w:2 or even w:1 and expect to loose a minimal amount of data on failovers, and in MongoDB 3.2 they could suddenly lose hundreds of milliseconds of transactions and this is arguably a regression in our capability.

Example test case

(Note, this is based on my understanding of our current election protocol, I didn't actually verify this so far. If we want to address this problem at all, first step would be to run this test. If this test doesn't trigger the behavior I'm describing for 3.2, it will nevertheless be possible to construct a more contrived test case that will.)

write concern = 2

  • Primary in North America
  • 2 secondaries in North America
  • 2 secondaries in Europe
  • All nodes have equal/default priority
  • Network RTT Europe-NA is roughly 100 ms

In MongoDB 2.4-3.0, a primary failure is likely to cause one of the other NA secondaries to become primary, because they will of course be 100 ms ahead their EU peers in replication.

In proposed MongoDB 3.2 design, all secondaries have equal probability to become primary. Therefore there is 50% chance that a EU node becomes primary and therefore the US secondaries would roll back 100 ms of their oplog.

Proposal

I believe it is possible to give more attention to not rolling back oplog as follows:

  • Execute the Raft-based election as currently planned
  • When a new primary is elected, he will first check all other reachable nodes for their oplog state.
  • If another node has a more recent opTime, connect to that and copy the missing part of the oplog and apply those on the primary.
  • Now start operating as the new primary.

Benefits of my proposal:

  • Doesn't mess with election protocol. Primary is elected as per Raft, then this fix is applied as additional step.
  • Ensures that operations that existing on any one available node will not be rolled back, rather will be applied on the primary

Drawbacks of my proposal:

  • Will make the failover time longer. Potentially this increase in failover time is unbounded too. But it would be possible to create some upper bound for this, for example by continuing the current rule that nodes that are more than 10 seconds behind are considered un-electable. (Such nodes must then be considered failed from a Raft point of view: they cannot participate in elections and therefore not in majority acknowledgements either.)
  • This is also true in cases where an application has been using w:majority and wouldn't care about losing transactions that exist on one node but weren't majority acknowledged. Hence users who want to minimize failover time must be able to turn this functionality off. (Possibly this could be turned on/off automatically by the primary detecting which write concerns are used by clients?)

Proposed priority

3.1 Desired or less. The justification for this is that per documentation we don't promise that rollback wouldn't happen to non-majority committed data.



 Comments   
Comment by Eric Milkie [ 06/Sep/16 ]

This idea was implemented as part of the work for SERVER-23663

Comment by Henrik Ingo (Inactive) [ 15/Apr/16 ]

A newly elected node should never roll anything back

I like this thinking. Makes sense.

And yes, this is of course very unlikely to happen, so not suggesting we optimize for it, just that it is handled correctly one way or another.

Comment by Henrik Ingo (Inactive) [ 14/Apr/16 ]

I just re-read the problem description in this ticket, and I think the example is a little misleading. For one thing, if the two North American nodes are truly 100ms ahead of the two European nodes, they won't vote for either of the European nodes, denying them the majority required to win the election. The real difference between the old and new protocols is what happens when exactly one of the surviving nodes has a write that none of the other nodes have, and this condition survives until some node's election timeout expires.

Not explicit, but I was thinking of a failure where 2 American nodes fail (e.g. are partitioned) and the surviving majority partition consists of 1 American node, which has all the recent transactions, and 2 European nodes which are 100ms behind with transactions, but could vote for each other to elect a new primary and forcing the American node to roll back transactions.

Scratch that. Now that I'm re-reading and refreshing my memory, what you say is correct.

The proposal is still interesting, as it could provide a useful knob for a user to use to balance the likelihood of losing w:2 writes in 5-node sets against the amount of time that must pass before the replica set becomes available for writes.

The proposal surely addresses a corner case (at best... I must emphasize I didn't actually test any of this). For better or worse, I live with a mind that easily spots corner cases.

Comment by Andy Schwerin [ 12/Apr/16 ]

I just re-read the problem description in this ticket, and I think the example is a little misleading. For one thing, if the two North American nodes are truly 100ms ahead of the two European nodes, they won't vote for either of the European nodes, denying them the majority required to win the election. The real difference between the old and new protocols is what happens when exactly one of the surviving nodes has a write that none of the other nodes have, and this condition survives until some node's election timeout expires.

In the original election protocol (selected via protocolVersion: 0 in the replica set configuration document in MongoDB 3.2 and later), a node participating in an election may veto a candidate if it believes that it or some third node is up and has an operation in its oplog that the candidate does not have. In the new election protocol (protocolVersion: 1), a node can only refrain from voting for a candidate, and then only if it believes that it has a newer operation in its own oplog than the candidate. However, during the time between the old primary crashing and the candidate standing for election, nodes continue to fetch operations from each others' oplogs, improving the likelihood that the newest operations end up in the majority of nodes' oplogs. I suspect that in practice this lowers the odds of losing the write that is initially present in only one secondary when the original primary first crashes.

The proposal is still interesting, as it could provide a useful knob for a user to use to balance the likelihood of losing w:2 writes in 5-node sets against the amount of time that must pass before the replica set becomes available for writes.

Generated at Thu Feb 08 03:47:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.