[SERVER-31274] With chaining enabled, a healthy primary may lose an election to a node that does not see it Created: 26/Sep/17  Updated: 27/Oct/23  Resolved: 02/Oct/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.2.0, 3.4.0, 3.5.13
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kelsey Schubert Assignee: Backlog - Replication Team
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Heartbeats.png    
Issue Links:
Related
related to SERVER-27125 Arbiters in pv1 should vote no in ele... Closed
related to SERVER-31429 Use the last fetched instead of the l... Closed
Assigned Teams:
Replication
Operating System: ALL
Participants:

 Description   

Consider the following example with PV1,

A three node replica set with chaining enabled where A is primary and C has B as it's sync source: A <- B <- C

If B does not see heartbeats from A, it will call for an election. C will vote for B despite having a healthy connection to A. As a result, a failover will occur despite a majority of nodes being able to see the primary.



 Comments   
Comment by Henrik Ingo (Inactive) [ 06/Oct/17 ]

That's actually exactly what I meant with my comment. As you point out, distributed systems will always have races like this, and trying to avoid them might often lead in the wrong direction (longer failover time / worst of both worlds). However, it may be possible to adjust code and/or parameters so that one outcome is likely to happen most of the time.

Comment by Spencer Brody (Inactive) [ 05/Oct/17 ]

While researching the code to prepare my previous comment I noticed a potential optimization that could help C change sync sources sooner and narrow the window in which the unnecessary failover will occur. I filed SERVER-31429 for that optimization. It won't eliminate the issue entirely, but might make it harder to hit.

Comment by Spencer Brody (Inactive) [ 05/Oct/17 ]

Yep, I think that's a good description. The thing is, we already have code to cause C to start syncing from A. When a node is caught up to its current sync source, and its current sync source doesn't have a sync source, we change sync sources: https://github.com/mongodb/mongo/blob/cb9f7cdcb7eb6ad9077ac8af3a4c0d7275c7e34f/src/mongo/db/repl/topology_coordinator_impl.cpp#L3015.

The problem is that that check fundamentally races with the election timeout on B, so if B's election timeout fires before C decides to switch sync sources, then the failover will happen. B is going to call for an election no matter what once it determines A to be down for longer than its election timeout, so to avoid the failover that election needs to fail because C is already replicating from A and is now ahead of B. For that to happen, B needs to notice that it can no longer sync from A, C needs to catch up to B (and in the process learn that B has no sync source), C needs to go through the sync source selection process and choose A as it's new source, and C needs to fetch and apply a new op from A. So the chances of B calling for an election before all that happens is decently high, and the only way to prevent that is to make B take longer to win its election, which will slow down all failovers.

Comment by Henrik Ingo (Inactive) [ 05/Oct/17 ]

Just an additional comment to elaborate on the trade off in play, to see if I follow your discussion correctly?

Since C is syncing from B, if B cannot sync from A, then both secondaries are in fact disconnected from A. For example, w:majority writes will not succeed at this point. Possible ways to fix this are:

  • New election as in the ticket description
  • C starts syncing from A

I think you're on the right path when you say that the goal is to minimize failover time. Whatever can be done to get ops succeeding again, as quickly as possible, is the right solution.

For the first alternative, there will be an additional cost when clients have to discover and reconnect to the newly elected primary B. As such, it could actually be beneficial to go the second path, C starts syncing from A. To know the right answer, we would have to do actual measurements. (And in any case, you are in a much better position than myself to reason about what the code is doing here.)

It's quite likely though, that "whoever happens to react first" is in fact the best, or at least a good choice. As such, changes to the current behavior would at best be low priority.

Comment by Spencer Brody (Inactive) [ 02/Oct/17 ]

There's no way to improve this without slowing down all failovers

Generated at Thu Feb 08 04:26:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.