Major - P3
(copied to CRM)
The current election protocol gives preference to a node who has the most recent opTime in his oplog. Essentially out of a pool of "electable" nodes, the one with the most advanced opTime wins the election. The rationale for this is to minimize the amount of data rolled back due to a failover.
The Raft protocol does not have such a component in the election protocol. We do not want to re-introduce it because then we could no longer rely on the robustness/correctness of the Raft protocol in our implementation. In fact it seems this part of the current election protocol has been a source of bugs (e.g. no master elected, etc...). I agree with this design.
The result is that in MongoDB 3.2, users will be much more likely to see data rollbacked that was written with write concern < majority. From a strict point of view this is ok, as we never have guaranteed that such writes would be safe during failover. From a practical point of view users can today run with w:2 or even w:1 and expect to loose a minimal amount of data on failovers, and in MongoDB 3.2 they could suddenly lose hundreds of milliseconds of transactions and this is arguably a regression in our capability.
Example test case
(Note, this is based on my understanding of our current election protocol, I didn't actually verify this so far. If we want to address this problem at all, first step would be to run this test. If this test doesn't trigger the behavior I'm describing for 3.2, it will nevertheless be possible to construct a more contrived test case that will.)
write concern = 2
- Primary in North America
- 2 secondaries in North America
- 2 secondaries in Europe
- All nodes have equal/default priority
- Network RTT Europe-NA is roughly 100 ms
In MongoDB 2.4-3.0, a primary failure is likely to cause one of the other NA secondaries to become primary, because they will of course be 100 ms ahead their EU peers in replication.
In proposed MongoDB 3.2 design, all secondaries have equal probability to become primary. Therefore there is 50% chance that a EU node becomes primary and therefore the US secondaries would roll back 100 ms of their oplog.
I believe it is possible to give more attention to not rolling back oplog as follows:
- Execute the Raft-based election as currently planned
- When a new primary is elected, he will first check all other reachable nodes for their oplog state.
- If another node has a more recent opTime, connect to that and copy the missing part of the oplog and apply those on the primary.
- Now start operating as the new primary.
Benefits of my proposal:
- Doesn't mess with election protocol. Primary is elected as per Raft, then this fix is applied as additional step.
- Ensures that operations that existing on any one available node will not be rolled back, rather will be applied on the primary
Drawbacks of my proposal:
- Will make the failover time longer. Potentially this increase in failover time is unbounded too. But it would be possible to create some upper bound for this, for example by continuing the current rule that nodes that are more than 10 seconds behind are considered un-electable. (Such nodes must then be considered failed from a Raft point of view: they cannot participate in elections and therefore not in majority acknowledgements either.)
- This is also true in cases where an application has been using w:majority and wouldn't care about losing transactions that exist on one node but weren't majority acknowledged. Hence users who want to minimize failover time must be able to turn this functionality off. (Possibly this could be turned on/off automatically by the primary detecting which write concerns are used by clients?)
3.1 Desired or less. The justification for this is that per documentation we don't promise that rollback wouldn't happen to non-majority committed data.