[SERVER-31274] With chaining enabled, a healthy primary may lose an election to a node that does not see it Created: 26/Sep/17 Updated: 27/Oct/23 Resolved: 02/Oct/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 3.2.0, 3.4.0, 3.5.13 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kelsey Schubert | Assignee: | Backlog - Replication Team |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Assigned Teams: |
Replication
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
Consider the following example with PV1, A three node replica set with chaining enabled where A is primary and C has B as it's sync source: A <- B <- C If B does not see heartbeats from A, it will call for an election. C will vote for B despite having a healthy connection to A. As a result, a failover will occur despite a majority of nodes being able to see the primary.
|
| Comments |
| Comment by Henrik Ingo (Inactive) [ 06/Oct/17 ] |
|
That's actually exactly what I meant with my comment. As you point out, distributed systems will always have races like this, and trying to avoid them might often lead in the wrong direction (longer failover time / worst of both worlds). However, it may be possible to adjust code and/or parameters so that one outcome is likely to happen most of the time. |
| Comment by Spencer Brody (Inactive) [ 05/Oct/17 ] |
|
While researching the code to prepare my previous comment I noticed a potential optimization that could help C change sync sources sooner and narrow the window in which the unnecessary failover will occur. I filed |
| Comment by Spencer Brody (Inactive) [ 05/Oct/17 ] |
|
Yep, I think that's a good description. The thing is, we already have code to cause C to start syncing from A. When a node is caught up to its current sync source, and its current sync source doesn't have a sync source, we change sync sources: https://github.com/mongodb/mongo/blob/cb9f7cdcb7eb6ad9077ac8af3a4c0d7275c7e34f/src/mongo/db/repl/topology_coordinator_impl.cpp#L3015. The problem is that that check fundamentally races with the election timeout on B, so if B's election timeout fires before C decides to switch sync sources, then the failover will happen. B is going to call for an election no matter what once it determines A to be down for longer than its election timeout, so to avoid the failover that election needs to fail because C is already replicating from A and is now ahead of B. For that to happen, B needs to notice that it can no longer sync from A, C needs to catch up to B (and in the process learn that B has no sync source), C needs to go through the sync source selection process and choose A as it's new source, and C needs to fetch and apply a new op from A. So the chances of B calling for an election before all that happens is decently high, and the only way to prevent that is to make B take longer to win its election, which will slow down all failovers. |
| Comment by Henrik Ingo (Inactive) [ 05/Oct/17 ] |
|
Just an additional comment to elaborate on the trade off in play, to see if I follow your discussion correctly? Since C is syncing from B, if B cannot sync from A, then both secondaries are in fact disconnected from A. For example, w:majority writes will not succeed at this point. Possible ways to fix this are:
I think you're on the right path when you say that the goal is to minimize failover time. Whatever can be done to get ops succeeding again, as quickly as possible, is the right solution. For the first alternative, there will be an additional cost when clients have to discover and reconnect to the newly elected primary B. As such, it could actually be beneficial to go the second path, C starts syncing from A. To know the right answer, we would have to do actual measurements. (And in any case, you are in a much better position than myself to reason about what the code is doing here.) It's quite likely though, that "whoever happens to react first" is in fact the best, or at least a good choice. As such, changes to the current behavior would at best be low priority. |
| Comment by Spencer Brody (Inactive) [ 02/Oct/17 ] |
|
There's no way to improve this without slowing down all failovers |