[SERVER-43904] When stepping down, step up doesn't filter out frozen nodes Created: 09/Oct/19 Updated: 29/Oct/23 Resolved: 13/Oct/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 3.6.14 |
| Fix Version/s: | 4.9.0, 4.0.23, 4.4.4, 4.2.13 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | David Bartley | Assignee: | Xuerui Fa |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | former-quick-wins | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Backport Requested: |
v4.4, v4.2, v4.0, v3.6
|
||||||||||||||||||||||||
| Sprint: | Repl 2020-10-19 | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Description |
|
One of the recommended ways [0] to force a particular node to become primary is to freeze all non-candidate nodes and then call replSetStepDown on the primary. As of MongoDB 3.6, that code attempts to step up a candidate (by calling replSetStepUp). However, that code doesn't exclude frozen nodes, and attempting to step up a frozen node will simply fail ("2019-10-09T00:24:05.517+0000 I REPL [conn352334] Not starting an election for a replSetStepUp request, since we are not electable due to: Not standing for election because I am still waiting for stepdown period to end at 2019-10-09T00:33:59.473+0000 (mask 0x20)"). This isn't particularly bad, since the unfrozen node will actually call for, and win, an election, but it does make failovers slower (up to electionTimeoutMillis slower, presumably). An alternative approach that we're using, that isn't explicitly documented, is to increase the priority of both the current and candidate node, and then run replSetStepDown. I've verified both in code and logs that this is effective at getting mongo to step up the candidate node consistently. It might be nice to document this approach, since I think it offers improvements over both approaches currently mentioned. Increasing the priority on just the candidate works, but tends to be slower since the "priority takeover" mechanism takes a few seconds to trigger, and provides less control than an explicit replSetStepDown. [0] https://docs.mongodb.com/manual/tutorial/force-member-to-be-primary/ |
| Comments |
| Comment by Githook User [ 14/Jan/21 ] | ||||||||||||||||||||||||
|
Author: {'name': 'XueruiFa', 'email': 'xuerui.fa@mongodb.com', 'username': 'XueruiFa'}Message: (cherry picked from commit bf614cb57059c74830633855e28b3f4677cd4f8d) | ||||||||||||||||||||||||
| Comment by Githook User [ 14/Jan/21 ] | ||||||||||||||||||||||||
|
Author: {'name': 'XueruiFa', 'email': 'xuerui.fa@mongodb.com', 'username': 'XueruiFa'}Message: (cherry picked from commit bf614cb57059c74830633855e28b3f4677cd4f8d) | ||||||||||||||||||||||||
| Comment by Githook User [ 04/Jan/21 ] | ||||||||||||||||||||||||
|
Author: {'name': 'XueruiFa', 'email': 'xuerui.fa@mongodb.com', 'username': 'XueruiFa'}Message: (cherry picked from commit bf614cb57059c74830633855e28b3f4677cd4f8d) | ||||||||||||||||||||||||
| Comment by Githook User [ 13/Oct/20 ] | ||||||||||||||||||||||||
|
Author: {'name': 'XueruiFa', 'email': 'xuerui.fa@mongodb.com', 'username': 'XueruiFa'}Message: | ||||||||||||||||||||||||
| Comment by Siyuan Zhou [ 31/Mar/20 ] | ||||||||||||||||||||||||
|
Thanks ying@stripe.com for the feedback. We'll keep this ticket to track the stepdown command's behavior with priorities and keep the general priority takeover issue in our radar. | ||||||||||||||||||||||||
| Comment by Ying Xu [ 31/Mar/20 ] | ||||||||||||||||||||||||
|
We are less concerned about the priority takeover because we don't use it. We tried it but it is worse than our current approach. It surprised us. We don't even know if it is a bug or expected behavior. If you think it is a bug, we can file a ticket. | ||||||||||||||||||||||||
| Comment by Siyuan Zhou [ 31/Mar/20 ] | ||||||||||||||||||||||||
I understand. My question is whether you are more concerned about the failover caused by planned maintenance or node failures. I understand you ran into this issue after a reconfig which seems a workaround of the undesired behavior cause by planned maintenance. Thus we are focusing on the planned maintenance in this ticket. If you are concerned about the general priority takeover behavior, could you please file a new ticket? | ||||||||||||||||||||||||
| Comment by Ying Xu [ 30/Mar/20 ] | ||||||||||||||||||||||||
|
The use case for us is to fail over the primary to a dedicated server or a set of servers with minimal write unavailability. In case it is not possible at a given time, we still want minimal write unavailability even if a different server is chosen as the primary.
| ||||||||||||||||||||||||
| Comment by Siyuan Zhou [ 29/Mar/20 ] | ||||||||||||||||||||||||
|
I agree it is possible for a high priority node to fail to take over. However, since this ticket is about primary stepdown. The solution could be different. As evin.roesle mentioned before, the best way to get the fastest elections is to allow the system to decide which nodes should be elected on stepdown. One solution could be to attempt to choose the highest priority node to be the candidate on stepdown. That should be sufficient for planned maintenance using replSetStepDown command. Is that the major use case for you? Changing the behavior of priority takeover on general failover is a separate issue and worth a new ticket. If that's also you concern, could you please file a new ticket? | ||||||||||||||||||||||||
| Comment by David Bartley [ 28/Mar/20 ] | ||||||||||||||||||||||||
|
And tbc, while the original bug was indeed about replSetStepDown sometimes resulting in relatively long periods without a primary, the suggested fix was to use "priority takeovers", but as noted above, this procedure also sometimes results in relatively long periods without a primary. In short, both of the two suggested ways to do a failover sometimes result in relatively long periods without a primary. | ||||||||||||||||||||||||
| Comment by David Bartley [ 28/Mar/20 ] | ||||||||||||||||||||||||
|
In this case there was no explicit "replSetStepDown"; instead, we just increased priority on the intended primary and waited for it to become primary. I guess it's not clear to me why this particular scenario shouldn't be expected? Specifically, the following seems like a pretty common scenario: The replset is now left without a primary. Some new node (node C above) will eventually run for, and win, a future election, but this does increase the period during which there is no primary. In contrast, with an explicit replSetStepDown the old primary steps down, and then chooses an eligible secondary (i.e. one that is caught up and capable of being primary) to explicitly step up. Since 1) writes will be blocked at the time of candidate selection, and 2) the candidate must be caught up, there's no possibility that this candidate will fail to be elected because it's lagged (but as mentioned originally, it can fail if it's frozen). | ||||||||||||||||||||||||
| Comment by Siyuan Zhou [ 28/Mar/20 ] | ||||||||||||||||||||||||
|
Thanks for the log. As mentioned before, we also need the timeline of the stepdown command and any node stepping up after the stepdown but before node A running priority takeover. Priority takeover can only happen when the node knows of an existing primary, so there should be another primary before the priority takeover. This ticket is about stepdown, any solution will have to take that into consideration. That's why it's a key to understand the issue. | ||||||||||||||||||||||||
| Comment by Ying Xu [ 27/Mar/20 ] | ||||||||||||||||||||||||
|
Yes, we followed "Force a Member to be Primary by Setting its Priority High". The following is the timeline of what happened. Node A is the node whose priority we increased, Node B is the old primary and Node C is the new primary.
Node A:
Node B:
Node A:
Node C:
| ||||||||||||||||||||||||
| Comment by Siyuan Zhou [ 27/Mar/20 ] | ||||||||||||||||||||||||
|
ying@stripe.com, thanks for your feedback. Before exploring any solution, I'd like to confirm your observation. By "the documented approach", I assume you're referring to "Force a Member to be Primary by Setting its Priority High" on this page. Could you please clarify the timeline of the elections, especially when the stepdown runs? Did a node with lower priority steps up immediately after the stepdown? Did the highest priority node start the election afterwards? The timestamps when these events happened will help us verify the behavior. | ||||||||||||||||||||||||
| Comment by Ying Xu [ 26/Mar/20 ] | ||||||||||||||||||||||||
|
We tried the documented approach but got worse write availability. The problem is the candidate primary (with highest priority) may succeed during the dry run election and start a new election. Once a new election starts, the old primary steps down. But the candidate fails to get enough votes during the real election. So there is no primary for an extended period of time until it reaches election timeout and another node starts a new election. Can the candidate start a new election immediately if it fails to get enough votes?
| ||||||||||||||||||||||||
| Comment by David Bartley [ 07/Jan/20 ] | ||||||||||||||||||||||||
|
It's possible it's just a longer wait time, and not actually a longer write unavailability time. | ||||||||||||||||||||||||
| Comment by Evin Roesle [ 14/Nov/19 ] | ||||||||||||||||||||||||
|
Hi bartle I'm Evin and product manager on Aly's team. Often the best way to get the fastest elections is to allow the system to decide which nodes should be elected. This way we can avoid a large chunk of time in Primary Catchup. Primary Catchup is the phase of the election in which the elected node makes a best-effort attempt to replicate all current data before accepting new writes. This is why I'm reluctant to add an option to stepdown that allows you to specify a candidate. I think this will lead to longer elections due to lack of insight about staleness. When you say that priority takeovers take longer when just specifying a higher priority on the candidate, this does not mean longer failover times with write unavailability but instead a longer wait time for the election to occur. Is that painful for you? The election time itself should be the same or even faster for priorityTakeovers because the node with the higher priority will wait until it is current in order to run, meaning the Primary Catchup phase does not take up any time minimizing write unavailability. Evin | ||||||||||||||||||||||||
| Comment by David Bartley [ 20/Oct/19 ] | ||||||||||||||||||||||||
|
I'm trying to minimize failover duration, which (I assume) is what the "step up" feature was intended to do. However, neither of the documented approaches to electing a particular node would use that "step up" feature. As stated above, it does not reliably work with replSetFreeze. The other method (increase just the candidate primary's priority) does not involve an explicit replSetStepDown, so obviously the "step up" feature doesn't come into play. In practice, both methods cause failovers that are a few seconds longer than the third approach I've listed earlier (increase priority on both current and candidate primary before running replSetStepDown). If there's no plan to fix the poor interaction of replSetFreeze and replSetStepUp, I'd suggest either removing mention of the "step up" optimization, or document a third approach that would reliably work with the "step up" optimization. | ||||||||||||||||||||||||
| Comment by Siyuan Zhou [ 19/Oct/19 ] | ||||||||||||||||||||||||
|
In that case, it sounds like you need priorities. Why didn't priority work for you? | ||||||||||||||||||||||||
| Comment by David Bartley [ 19/Oct/19 ] | ||||||||||||||||||||||||
|
Yes, adding an optional argument to replSetStepDown seems like it'd simplify the existing approaches to failing over to a specific node. Would you see much utility in allowing a list of candidates? I guess one use case might be a situation where you're trying to do a cross-region failover, and don't particularly care which node is elected, as long as it's in the other region. | ||||||||||||||||||||||||
| Comment by Siyuan Zhou [ 17/Oct/19 ] | ||||||||||||||||||||||||
|
We removed this behavior in the Election Handoff project. I'd hope to avoid the complexity of adding it back. Besides, propagating via heartbeats is racy. Since the major use case of replSetFreeze is to choose a specific node to be the next primary, an alternative solution is to add an option to replSetStepDown command to nominate the next primary, so that stepdown will wait for the specific node instead. | ||||||||||||||||||||||||
| Comment by Danny Hatcher (Inactive) [ 10/Oct/19 ] | ||||||||||||||||||||||||
|
Thanks for the report. I'll pass this on to the Replication team to determine if there is an improvement to be made in regards to "stepping up" frozen nodes. | ||||||||||||||||||||||||
| Comment by David Bartley [ 09/Oct/19 ] | ||||||||||||||||||||||||
|
Specifically, frozen nodes seem to be considered unelectable, so you'd expect "e" would be false. | ||||||||||||||||||||||||
| Comment by David Bartley [ 09/Oct/19 ] | ||||||||||||||||||||||||
|
Digging a bit into the code, it seems like frozenness is supposed to be communicated via the "e" field of heartbeats (under pv1 anyway), and the step-up code also seems to only consider electable candidates. One possibility is that there's a race between nodes being frozen and the current primary discovering that, but even with a 10s sleep between those (i.e. 5x heartbeatMillis) I still observe (unsuccessful) step-ups to a frozen node. |