Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Duplicate
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Replication
Labels:
None

Assigned Teams:

Replication
Case:
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Background

The current election protocol gives preference to a node who has the most recent opTime in his oplog. Essentially out of a pool of "electable" nodes, the one with the most advanced opTime wins the election. The rationale for this is to minimize the amount of data rolled back due to a failover.

The Raft protocol does not have such a component in the election protocol. We do not want to re-introduce it because then we could no longer rely on the robustness/correctness of the Raft protocol in our implementation. In fact it seems this part of the current election protocol has been a source of bugs (e.g. no master elected, etc...). I agree with this design.

The result is that in MongoDB 3.2, users will be much more likely to see data rollbacked that was written with write concern < majority. From a strict point of view this is ok, as we never have guaranteed that such writes would be safe during failover. From a practical point of view users can today run with w:2 or even w:1 and expect to loose a minimal amount of data on failovers, and in MongoDB 3.2 they could suddenly lose hundreds of milliseconds of transactions and this is arguably a regression in our capability.

Example test case

(Note, this is based on my understanding of our current election protocol, I didn't actually verify this so far. If we want to address this problem at all, first step would be to run this test. If this test doesn't trigger the behavior I'm describing for 3.2, it will nevertheless be possible to construct a more contrived test case that will.)

write concern = 2

Primary in North America
2 secondaries in North America
2 secondaries in Europe
All nodes have equal/default priority
Network RTT Europe-NA is roughly 100 ms

In MongoDB 2.4-3.0, a primary failure is likely to cause one of the other NA secondaries to become primary, because they will of course be 100 ms ahead their EU peers in replication.

In proposed MongoDB 3.2 design, all secondaries have equal probability to become primary. Therefore there is 50% chance that a EU node becomes primary and therefore the US secondaries would roll back 100 ms of their oplog.

Proposal

I believe it is possible to give more attention to not rolling back oplog as follows:

Execute the Raft-based election as currently planned
When a new primary is elected, he will first check all other reachable nodes for their oplog state.
If another node has a more recent opTime, connect to that and copy the missing part of the oplog and apply those on the primary.
Now start operating as the new primary.

Benefits of my proposal:

Doesn't mess with election protocol. Primary is elected as per Raft, then this fix is applied as additional step.
Ensures that operations that existing on any one available node will not be rolled back, rather will be applied on the primary

Drawbacks of my proposal:

Will make the failover time longer. Potentially this increase in failover time is unbounded too. But it would be possible to create some upper bound for this, for example by continuing the current rule that nodes that are more than 10 seconds behind are considered un-electable. (Such nodes must then be considered failed from a Raft point of view: they cannot participate in elections and therefore not in majority acknowledgements either.)
This is also true in cases where an application has been using w:majority and wouldn't care about losing transactions that exist on one node but weren't majority acknowledged. Hence users who want to minimize failover time must be able to turn this functionality off. (Possibly this could be turned on/off automatically by the primary detecting which write concerns are used by clients?)

Proposed priority

3.1 Desired or less. The justification for this is that per documentation we don't promise that rollback wouldn't happen to non-majority committed data.

duplicates

SERVER-23663 New primary syncs from chosen node to catch up with timeout

Closed

is duplicated by

SERVER-22502 Replication Protocol 1 rollbacks are more likely during priority takeover

Closed

related to

SERVER-14539 Full consensus arbiter (i.e. uses an oplog)

Backlog

SERVER-11086 Election handoff to new primary, during stepdown

Closed

Assignee:: [DO NOT USE] Backlog - Replication Team
Reporter:: Henrik Ingo (Inactive)
Participants:: [DO NOT USE] Backlog - Replication Team, Andy Schwerin, Eric Milkie, Henrik Ingo
Votes:: 1 Vote for this issue
Watchers:: 35 Start watching this issue

Created:: May 13 2015 11:15:38 AM UTC
Updated:: Dec 06 2022 04:51:41 AM UTC
Resolved:: Sep 06 2016 08:24:16 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates