[SERVER-35200] Speed up failure detection in the OplogFetcher during steady state replication Created: 23/May/18 Updated: 08/Jan/24 Resolved: 06/Jul/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 3.4.18, 3.6.7, 4.0.1, 4.1.1 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Spencer Brody (Inactive) | Assignee: | William Schultz (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v4.0, v3.6, v3.4
|
||||||||||||||||||||||||||||||||||||||||
| Sprint: | Repl 2018-07-02, Repl 2018-07-16 | ||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||||||||||
| Description |
|
We added retry logic into the OplogFetcher in |
| Comments |
| Comment by Githook User [ 01/Oct/18 ] |
|
Author: {'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}Message: This patch attempts to improve how quickly a secondary node in (cherry picked from commit 2c6614c3bd716fb9ccaad1f7c68e9eb490ed1df6) |
| Comment by Githook User [ 24/Jul/18 ] |
|
Author: {'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}Message: This patch attempts to improve how quickly a secondary node in (cherry picked from commit 2c6614c3bd716fb9ccaad1f7c68e9eb490ed1df6) |
| Comment by Githook User [ 06/Jul/18 ] |
|
Author: {'username': 'will62794', 'name': 'William Schultz', 'email': 'william.schultz@mongodb.com'}Message: This patch attempts to improve how quickly a secondary node in (cherry picked from commit 2c6614c3bd716fb9ccaad1f7c68e9eb490ed1df6) |
| Comment by Githook User [ 06/Jul/18 ] |
|
Author: {'username': 'will62794', 'name': 'William Schultz', 'email': 'william.schultz@mongodb.com'}Message: Revert " This reverts commit 52c94c89cf0a8aceb87a2fd80055838d91540ff9. |
| Comment by Githook User [ 06/Jul/18 ] |
|
Author: {'username': 'will62794', 'name': 'William Schultz', 'email': 'william.schultz@mongodb.com'}Message: This patch attempts to improve how quickly a secondary node in (cherry picked from commit 2c6614c3bd716fb9ccaad1f7c68e9eb490ed1df6) |
| Comment by Githook User [ 06/Jul/18 ] |
|
Author: {'username': 'will62794', 'name': 'William Schultz', 'email': 'william.schultz@mongodb.com'}Message: This patch attempts to improve how quickly a secondary node in |
| Comment by William Schultz (Inactive) [ 19/Jun/18 ] |
|
After looking at this a bit and reading through a few of the related help tickets, I have a few thoughts. At a high level, it appears that one goal here is to help make the "majority write failover time" as close as possible to the "primary failover time", where "majority write failover time" is the amount of time between when an old primary is able to commit majority writes and when a new primary is able to do so. Improvements to this "majority failover failover time" can, I think, fall into two categories: improving in the absence of failover and in the presence of failover. I separated the discussion of the improvements into these categories. Most are very similar to what spencer and judah.schvimer already suggested, but I am including some additional rationale and explanation. Absence of Failover In the absence of a failover (i.e. there exists a stable primary in contact with a majority), I think that the OplogFetcher "retry" duration (cumulative amount of time spent trying to connect/communicate with sync source) should not reasonably exceed the election timeout. If we detect that a node is down via this communication channel, then we should make the decision to restart the sync source selection process, since we cannot sync from a down node. Throughout the system we should be using the election timeout as our standard criterion for determining "down-ness" of another node. If we can't communicate to some other node within an election timeout, by any means, we should use that information to determine a node is down. As an alternative to actually changing the timeout, we could consider injecting the liveness information from heartbeats etc. into the OplogFetcher, to tell it when it should stop syncing from another node. This may keep the fetcher more decoupled from the consensus logic. For example, when a node transitions to DOWN state, we could restart the sync source selection process if this node is our current sync source. Changing the timeouts of the fetcher would probably be simpler, though. Presence of Failover In the presence of failover, I think there is a separate optimization to improve our "majority failover time", similar to what Spencer already mentioned. I think that this behavioral change has different dynamics in the "chaining enabled" and "chaining disabled" cases, so I will discuss them separately. Chaining disabled: In the case when chaining is disabled, the basic optimization is to switch sync sources immediately if you learn about a primary that was elected in a higher term than your current sync source. This is reasonable (and likely optimal) since, with chaining disallowed, we know that we must sync from this new primary (or a newer primary) in the future. So, switching sync sources immediately is a way of pre-empting this eventual sync source switch. I imagine one way to implement this would be that if we receive a heartbeat from a primary in a higher term than the current primary (our current sync source), then we should immediately decide to change sync sources. Spencer mentioned remembering the original term of the primary you are syncing from in the fetcher, and only retrying if your current term is the same. That is an alternative approach, with the caveat that you may decide to not retry even if no new primary has yet been elected e.g. in the case of a failed election in a higher term. Presumably, though, knowledge of a new term indicates that eventually your current sync source is going to step down, so you might as well cut off syncing right away. Either approach may be suitable. Chaining enabled: With chaining enabled, I am less sure if there is a clear optimization to make. One thought is to do something similar to the non-chaining case. If you are currently syncing from a primary, and you receive a heartbeat informing you about a primary in a new term, you would decide to immediately enter sync source selection. In the non-chaining case, it should be trivial to make this decision because you know that you wouldn't be able to sync from the now stale primary during the new term, since it would eventually step down and become a secondary. So, it would make no sense to wait any longer trying to sync from it. In the chaining case, though, it is possible that the now stale primary could be a valid sync source in the newer term, since you are allowed to sync from secondaries. The amount of time until you would be able to sync from this node as a secondary, though, is going to be at least as long as it takes for the node to step down (kill connections) and select a new sync source. In this case, it seems like you may be wasting time waiting for this node to re-establish a sync source when you could just as well be carrying on your own selection of a sync source. One way this behavior might end up being detrimental is if this old primary is actually the closest node to you in terms of latency. If you didn't wait longer for the node to become a potentially valid sync source, then you might have overlooked it as a candidate when you pre-emptively select a new source. I'm not sure that's a big issue though. To me, the chaining case perhaps seems like less of a definitive win, and it seems harder to reason about the possible unintended impacts, so maybe just lowering the network timeouts enough will be good enough in this case. One other behavior about the chaining case that seems reasonable to maintain is when nodes far downstream don't absolutely need to change sync sources. It might be fine for them to keep syncing from a certain secondary, even if that secondary has to go through a process of finding a new primary to sync from due to a failover. The solution of always changing sync sources if the original term of your sync source is different from your current term doesn't seem to take this scenario into account. It may force downstream secondaries to re-select sync sources even if their current sync source has changed terms, but is a perfectly fine sync source to continue syncing from. This is just one other scenario that I thought of, but there may be others that are more relevant/persuasive. Obviously, as a short term and sub-optimal fix, lowering the network timeouts is the easiest solution. I attached a repro (secondary_failover_time.js |
| Comment by Spencer Brody (Inactive) [ 29/May/18 ] |
|
Another possibility is to remember the term when the oplog fetcher was started, and if you encounter an error, only retry if the current term is the same. |
| Comment by Judah Schvimer [ 25/May/18 ] |
|
Sync source selection/resolution is certainly more costly than retrying the find, which I expect to solve the problem in the common case of a network blip rather than a partition. I think lowering the network timeout on retries makes sense, especially for the initial find command which currently has a timeout of over 1 minute I think. 1 retry also seems reasonable. |
| Comment by Spencer Brody (Inactive) [ 24/May/18 ] |
|
Oh interesting, good catch. That means I misdiagnosed the issue I was working on with arnie.listhaus. He was doing testing of how quickly a secondary recovers when it has become partitioned from the primary. What I thought was happening was that the node had detected an error with its sync source and was looking for a new sync source but kept retrying to connect to the same candidate sync source and failing, before looking for a new sync source. What I think was actually happening was that the node was happily syncing from its source, then when it got partitioned it got an error in the oplog fetcher. The oplog fetcher retried 3 times to connect to its current source before even entering sync source selection for the first time. As soon as the node gets into sync source selection, it recovers quite quickly after selecting the remaining node as its new sync source.
So the question now is, in steady state, when a secondary gets any error talking to its current sync source, should it retry to connect a few times, or should it immediately enter sync source selection. I feel like probably it should still retry connecting before looking for a new sync source, to increase the stability of the spanning tree, but it should probably only retry once, and it should probably use a lower network timeout (and corresponding awaitData timeout) on the retry than it does when connecting to a new sync source for the first time. |
| Comment by Judah Schvimer [ 24/May/18 ] |
|
The SyncSourceResolver does not use the OplogFetcher. Do you mean the Fetcher, or do you mean normal steady state replication outside of the SyncSourceResolver? I'm hesitant to remove retries from the OplogFetcher because it initially prevented a lot of spurious build failures, though maybe our improvements to our testing robustness make this unnecessary. |