[SERVER-39867] Temper ticket count recovery during periods of low replication lag Created: 27/Feb/19 Updated: 29/Oct/23 Resolved: 03/May/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 4.1.11 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Maria van Keulen | Assignee: | Daniel Gottlieb (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Sprint: | Storage NYC 2019-05-06 | ||||||||
| Participants: | |||||||||
| Description |
|
The flow control mechanism reduces the number of ticket acquisitions permitted when replication lag reaches a certain threshold. Presently, when replication lag is low, the number of flow control ticket acquisitions permitted is increased too aggressively, leading to oscillations in throughput. Oscillations can be mitigated by tempering the increase in the number of ticket acquisitions when lag is low. |
| Comments |
| Comment by Githook User [ 03/May/19 ] | ||||||
|
Author: {'email': 'daniel.gottlieb@mongodb.com', 'name': 'Daniel Gottlieb', 'username': 'dgottlieb'}Message: | ||||||
| Comment by Daniel Gottlieb (Inactive) [ 30/Apr/19 ] | ||||||
|
In this comment I refer to a primary and a secondary. This is to simplify the scenario of a primary and two secondaries where the secondaries process writes at equal rates. I tried a different algorithm for flow control that reduce oscillations (i.e: more predictable majority latencies, with trade-offs) in scenarios where a secondary can process exactly 1/4 the operations the primary can (a severe degradation). The original algorithm will, when lagged, have a primary accept writes at half the rate as the secondary. The problem of the steady-state oscillations in the original algorithm is that when a primary notices it's lagged, it puts on the breaks very hard; processing only half of what the secondary can. When the majority lag drops under a threshold (5 seconds), the primary will quickly increase its throughput. Even though the secondary empties the queue of writes to replicate, the 5 second "runway" a primary has is enough to aggressively outpace the secondary again before deciding it should slow down. The solution to this (while keeping a "threshold" that determines whether to throttle based on the secondary/versus increase the ticket allocation) is to actually let the primary accept writes roughly on pace with secondaries when lag is near the threshold. As the lag gets worse, the primary will scale down the number of writes it will accept. In essence, this is accomplished with a function that looks like:
when lag > threshold for some constant 0 < k < 1. The following simulations use k = 0.5 and threshold = 5 seconds. Thus at 10 seconds of lag, the primary will reduce writes to half of what a secondary is processing (roughly, I've added another 95% factor such that lag=5 seconds has the primary accepting a few less writes than the secondary can do). Using bruce.lucas oscillation script that simulates flow control, I ran four different algorithms, two of the old playing with the rate of increase variable (when the primary determines the replica set is "healthy", a multiplier of 1.1 and 1.05) and two on the new algorithm also playing with the rate of increase variable, for parity. I took two screenshots of FTDC. One showing the whole range. Specifically, this covers the max lag when starting out. The second zooms in on the steady state oscillations to compute the average/max latencies as well as throughputs there.
What we see in the first screenshot is that max lag is the same on both algorithms. This makes sense because the workloads both have primaries outputting > 2X the secondary rate (i.e: when the primary notices lag 5 seconds in, it has already produced > 5 additional seconds of work for the secondary to process). The new algorithm recovers from this a bit quicker because it throttles writes a lot more (if the lag never exceeded 10 seconds, I would expect the old algorithm to "recover" sooner). We can also observe that overall, the new algorithm has better average throughput, but worse average majority latency compared to their counterparts. Looking at the second image which zooms in on steady state, we can see that a few things:
Some of those observations are, once again, due to the severe degradation of the secondary which results in primaries getting a big head start to pick up throughput momentum and demonstrate their processing superiority. Running the new/old algorithms against laggy-7 however did not demonstrate any conclusive differences, theoretical or otherwise. The simulation used the following code (the old algorithm is commented out on top, the new algorithm follows with the ternary expanded):
| ||||||
| Comment by Maria van Keulen [ 16/Apr/19 ] | ||||||
|
This ticket should also make the ticket reduction constant and the ticket multiplier and increment constants user-configurable. |