[SERVER-35200] Speed up failure detection in the OplogFetcher during steady state replication Created: 23/May/18  Updated: 08/Jan/24  Resolved: 06/Jul/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 3.4.18, 3.6.7, 4.0.1, 4.1.1

Type: Improvement Priority: Major - P3
Reporter: Spencer Brody (Inactive) Assignee: William Schultz (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File secondary_failover_time.js    
Issue Links:
Backports
Duplicate
duplicates SERVER-32703 Secondary can take a couple minutes t... Closed
is duplicated by SERVER-35835 Allow quicker sync source change when... Closed
Related
related to SERVER-25702 add support to OplogFetcher for resta... Closed
related to SERVER-32703 Secondary can take a couple minutes t... Closed
related to SERVER-35835 Allow quicker sync source change when... Closed
is related to SERVER-42910 Oplog query with higher timestamp but... Closed
is related to SERVER-35996 Create performance tests for measurin... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.0, v3.6, v3.4
Sprint: Repl 2018-07-02, Repl 2018-07-16
Participants:
Case:

 Description   

We added retry logic into the OplogFetcher in SERVER-25702 to increase robustness of initial sync.  By doing so, however, we also added these restarts into regular steady state replication.  This can slow down sync source resolution unnecessarily when one node is unavailable but other nodes would be able to serve as the sync source.



 Comments   
Comment by Githook User [ 01/Oct/18 ]

Author:

{'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}

Message: SERVER-35200 Speed up steady state oplog fetching failure detection

This patch attempts to improve how quickly a secondary node in
steady state replication detects that its current sync source has failed
or become partitioned and tries to select a new sync souce. The speed of
this process can significantly impact how long it takes for a new
primary to begin committing majority writes after a previous primary
has failed or become partitioned from the replica set. This patch
improves on the old behavior by reducing the number of fetcher restarts
to 1, and also reducing the oplog 'find' request timeout used on a restart.

(cherry picked from commit 2c6614c3bd716fb9ccaad1f7c68e9eb490ed1df6)
Branch: v3.4
https://github.com/mongodb/mongo/commit/2524d2aa24d137b6fa4e48c5f199eefa6e1cadaa

Comment by Githook User [ 24/Jul/18 ]

Author:

{'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}

Message: SERVER-35200 Speed up steady state oplog fetching failure detection

This patch attempts to improve how quickly a secondary node in
steady state replication detects that its current sync source has failed
or become partitioned and tries to select a new sync souce. The speed of
this process can significantly impact how long it takes for a new
primary to begin committing majority writes after a previous primary
has failed or become partitioned from the replica set. This patch
improves on the old behavior by reducing the number of fetcher restarts
to 1, and also reducing the oplog 'find' request timeout used on a restart.

(cherry picked from commit 2c6614c3bd716fb9ccaad1f7c68e9eb490ed1df6)
Branch: v3.6
https://github.com/mongodb/mongo/commit/dda0b877575c4be05dcafe8d94abd1bc2a66980c

Comment by Githook User [ 06/Jul/18 ]

Author:

{'username': 'will62794', 'name': 'William Schultz', 'email': 'william.schultz@mongodb.com'}

Message: SERVER-35200 Speed up steady state oplog fetching failure detection

This patch attempts to improve how quickly a secondary node in
steady state replication detects that its current sync source has failed
or become partitioned and tries to select a new sync souce. The speed of
this process can significantly impact how long it takes for a new
primary to begin committing majority writes after a previous primary
has failed or become partitioned from the replica set. This patch
improves on the old behavior by reducing the number of fetcher restarts
to 1, and also reducing the oplog 'find' request timeout used on a restart.

(cherry picked from commit 2c6614c3bd716fb9ccaad1f7c68e9eb490ed1df6)
Branch: v4.0
https://github.com/mongodb/mongo/commit/b39a4fdd7ab50f9965f0cf4a5f21242ba6f87b0a

Comment by Githook User [ 06/Jul/18 ]

Author:

{'username': 'will62794', 'name': 'William Schultz', 'email': 'william.schultz@mongodb.com'}

Message: Revert "SERVER-35200 Speed up steady state oplog fetching failure detection"

This reverts commit 52c94c89cf0a8aceb87a2fd80055838d91540ff9.
Branch: v4.0
https://github.com/mongodb/mongo/commit/bcc102af976d5f4a905e48c0f6bb856b09289c3a

Comment by Githook User [ 06/Jul/18 ]

Author:

{'username': 'will62794', 'name': 'William Schultz', 'email': 'william.schultz@mongodb.com'}

Message: SERVER-35200 Speed up steady state oplog fetching failure detection

This patch attempts to improve how quickly a secondary node in
steady state replication detects that its current sync source has failed
or become partitioned and tries to select a new sync souce. The speed of
this process can significantly impact how long it takes for a new
primary to begin committing majority writes after a previous primary
has failed or become partitioned from the replica set. This patch
improves on the old behavior by reducing the number of fetcher restarts
to 1, and also reducing the oplog 'find' request timeout used on a restart.

(cherry picked from commit 2c6614c3bd716fb9ccaad1f7c68e9eb490ed1df6)
Branch: v4.0
https://github.com/mongodb/mongo/commit/52c94c89cf0a8aceb87a2fd80055838d91540ff9

Comment by Githook User [ 06/Jul/18 ]

Author:

{'username': 'will62794', 'name': 'William Schultz', 'email': 'william.schultz@mongodb.com'}

Message: SERVER-35200 Speed up steady state oplog fetching failure detection

This patch attempts to improve how quickly a secondary node in
steady state replication detects that its current sync source has failed
or become partitioned and tries to select a new sync souce. The speed of
this process can significantly impact how long it takes for a new
primary to begin committing majority writes after a previous primary
has failed or become partitioned from the replica set. This patch
improves on the old behavior by reducing the number of fetcher restarts
to 1, and also reducing the oplog 'find' request timeout used on a restart.
Branch: master
https://github.com/mongodb/mongo/commit/2c6614c3bd716fb9ccaad1f7c68e9eb490ed1df6

Comment by William Schultz (Inactive) [ 19/Jun/18 ]

After looking at this a bit and reading through a few of the related help tickets, I have a few thoughts. At a high level, it appears that one goal here is to help make the "majority write failover time" as close as possible to the "primary failover time", where "majority write failover time" is the amount of time between when an old primary is able to commit majority writes and when a new primary is able to do so. Improvements to this "majority failover failover time" can, I think, fall into two categories: improving in the absence of failover and in the presence of failover. I separated the discussion of the improvements into these categories. Most are very similar to what spencer and judah.schvimer already suggested, but I am including some additional rationale and explanation.

Absence of Failover

In the absence of a failover (i.e. there exists a stable primary in contact with a majority), I think that the OplogFetcher "retry" duration (cumulative amount of time spent trying to connect/communicate with sync source) should not reasonably exceed the election timeout. If we detect that a node is down via this communication channel, then we should make the decision to restart the sync source selection process, since we cannot sync from a down node. Throughout the system we should be using the election timeout as our standard criterion for determining "down-ness" of another node. If we can't communicate to some other node within an election timeout, by any means, we should use that information to determine a node is down. As an alternative to actually changing the timeout, we could consider injecting the liveness information from heartbeats etc. into the OplogFetcher, to tell it when it should stop syncing from another node. This may keep the fetcher more decoupled from the consensus logic. For example, when a node transitions to DOWN state, we could restart the sync source selection process if this node is our current sync source. Changing the timeouts of the fetcher would probably be simpler, though.

Presence of Failover

In the presence of failover, I think there is a separate optimization to improve our "majority failover time", similar to what Spencer already mentioned. I think that this behavioral change has different dynamics in the "chaining enabled" and "chaining disabled" cases, so I will discuss them separately.

Chaining disabled: In the case when chaining is disabled, the basic optimization is to switch sync sources immediately if you learn about a primary that was elected in a higher term than your current sync source. This is reasonable (and likely optimal) since, with chaining disallowed, we know that we must sync from this new primary (or a newer primary) in the future. So, switching sync sources immediately is a way of pre-empting this eventual sync source switch. I imagine one way to implement this would be that if we receive a heartbeat from a primary in a higher term than the current primary (our current sync source), then we should immediately decide to change sync sources. Spencer mentioned remembering the original term of the primary you are syncing from in the fetcher, and only retrying if your current term is the same. That is an alternative approach, with the caveat that you may decide to not retry even if no new primary has yet been elected e.g. in the case of a failed election in a higher term. Presumably, though, knowledge of a new term indicates that eventually your current sync source is going to step down, so you might as well cut off syncing right away. Either approach may be suitable.

Chaining enabled: With chaining enabled, I am less sure if there is a clear optimization to make. One thought is to do something similar to the non-chaining case. If you are currently syncing from a primary, and you receive a heartbeat informing you about a primary in a new term, you would decide to immediately enter sync source selection. In the non-chaining case, it should be trivial to make this decision because you know that you wouldn't be able to sync from the now stale primary during the new term, since it would eventually step down and become a secondary. So, it would make no sense to wait any longer trying to sync from it. In the chaining case, though, it is possible that the now stale primary could be a valid sync source in the newer term, since you are allowed to sync from secondaries. The amount of time until you would be able to sync from this node as a secondary, though, is going to be at least as long as it takes for the node to step down (kill connections) and select a new sync source. In this case, it seems like you may be wasting time waiting for this node to re-establish a sync source when you could just as well be carrying on your own selection of a sync source. One way this behavior might end up being detrimental is if this old primary is actually the closest node to you in terms of latency. If you didn't wait longer for the node to become a potentially valid sync source, then you might have overlooked it as a candidate when you pre-emptively select a new source. I'm not sure that's a big issue though. To me, the chaining case perhaps seems like less of a definitive win, and it seems harder to reason about the possible unintended impacts, so maybe just lowering the network timeouts enough will be good enough in this case.

One other behavior about the chaining case that seems reasonable to maintain is when nodes far downstream don't absolutely need to change sync sources. It might be fine for them to keep syncing from a certain secondary, even if that secondary has to go through a process of finding a new primary to sync from due to a failover. The solution of always changing sync sources if the original term of your sync source is different from your current term doesn't seem to take this scenario into account. It may force downstream secondaries to re-select sync sources even if their current sync source has changed terms, but is a perfectly fine sync source to continue syncing from. This is just one other scenario that I thought of, but there may be others that are more relevant/persuasive.

Obviously, as a short term and sub-optimal fix, lowering the network timeouts is the easiest solution. I attached a repro (secondary_failover_time.js) that seems to demonstrate this issue from HELP-6141 pretty clearly. While the election of a new primary took only around 10 seconds in total, it took nearly a minute for majority write to the new primary to commit. Lowering this timeout definitely appeared to help.

Comment by Spencer Brody (Inactive) [ 29/May/18 ]

Another possibility is to remember the term when the oplog fetcher was started, and if you encounter an error, only retry if the current term is the same.

Comment by Judah Schvimer [ 25/May/18 ]

Sync source selection/resolution is certainly more costly than retrying the find, which I expect to solve the problem in the common case of a network blip rather than a partition. I think lowering the network timeout on retries makes sense, especially for the initial find command which currently has a timeout of over 1 minute I think. 1 retry also seems reasonable.

Comment by Spencer Brody (Inactive) [ 24/May/18 ]

Oh interesting, good catch.  That means I misdiagnosed the issue I was working on with arnie.listhaus.  He was doing testing of how quickly a secondary recovers when it has become partitioned from the primary.  What I thought was happening was that the node had detected an error with its sync source and was looking for a new sync source but kept retrying to connect to the same candidate sync source and failing, before looking for a new sync source.

What I think was actually happening was that the node was happily syncing from its source, then when it got partitioned it got an error in the oplog fetcher.  The oplog fetcher retried 3 times to connect to its current source before even entering sync source selection for the first time.  As soon as the node gets into sync source selection, it recovers quite quickly after selecting the remaining node as its new sync source.

 

So the question now is, in steady state, when a secondary gets any error talking to its current sync source, should it retry to connect a few times, or should it immediately enter sync source selection.  I feel like probably it should still retry connecting before looking for a new sync source, to increase the stability of the spanning tree, but it should probably only retry once, and it should probably use a lower network timeout (and corresponding awaitData timeout) on the retry than it does when connecting to a new sync source for the first time.

Comment by Judah Schvimer [ 24/May/18 ]

The SyncSourceResolver does not use the OplogFetcher. Do you mean the Fetcher, or do you mean normal steady state replication outside of the SyncSourceResolver? I'm hesitant to remove retries from the OplogFetcher because it initially prevented a lot of spurious build failures, though maybe our improvements to our testing robustness make this unnecessary.

Generated at Thu Feb 08 04:39:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.