[SERVER-46114] Flow-control engages on a single-node replica set Created: 12/Feb/20 Updated: 27/Oct/23 Resolved: 16/Apr/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Geert Bosch | Assignee: | Maria van Keulen |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Steps To Reproduce: | The log file includes messages like:
|
||||||||
| Sprint: | Execution Team 2020-04-20 | ||||||||
| Participants: | |||||||||
| Description |
|
Flow-control is engaging on single-node replica sets, but it should not.
|
| Comments |
| Comment by Mark Callaghan (Inactive) [ 03/Sep/20 ] |
|
Percona shared a bug report with me and this is the cause. See SERVER-50749. The impact is significant. First the load rate drops. Second, some inserts take 200 to 300 seconds when flow control engages. |
| Comment by Alexander Gorrod [ 01/Jul/20 ] |
It looks like I jumped too soon on this one. I still think it's surprising that flow control is kicking in on a single node replica set, but it isn't hurting the pytpcc load phase in my testing, so that isn't a reason to resurrect this ticket. |
| Comment by Bruce Lucas (Inactive) [ 30/Jun/20 ] |
|
For the purposes of doing that comparison it might be useful to disable flow control on both 4.2 and 4.4 in order to remove that complication and focus on measuring the relative storage engine performance. |
| Comment by Eric Milkie [ 30/Jun/20 ] |
|
We could consider it, but my understanding of From the chart, it would appear that flow control is smoothing out the spikes in throughput over time, at the expense of average throughput. Wouldn't a chart look similar for a multi-node replica set? I am not sure if users would prefer more spiky performance or more stable performance, and also not sure if that preference is different for one-node replica sets.
|
| Comment by Alexander Gorrod [ 30/Jun/20 ] |
|
milkie and michael.gargiulo I've been running the pytpcc load workload described in Would you reconsider whether flow control should be disabled in 1-node replica set configurations?
|
| Comment by Maria van Keulen [ 16/Apr/20 ] |
|
In case anybody is curious, I also did a run of this workload with Flow Control disabled entirely, with similar write latency spikes and durable/lastCommitted lag: Here's a side-by-side comparison, including throughput: |
| Comment by Maria van Keulen [ 16/Apr/20 ] |
|
It looks like the lastCommitted lag coincides with the lastDurable lag for this workload: The lastDurable OpTime is used to determine the lastCommitted OpTime by default. So, any lag between lastApplied and lastDurable can influence lastCommitted lag. daniel.gottlieb also pointed out that an overwhelmed system could starve writers long enough for Flow Control to kick in. There are spikes in write latency throughout this workload, even when Flow Control isn't engaged, suggesting strain on the mongod. Here's an example of such spikes in a section where Flow Control was not throttling any writes: I think the lag in lastDurable is due to the strain on the system, and we can conclude that the Flow Control lag measure can be overstated when the node itself is under strain. There is an existing ticket, |
| Comment by Maria van Keulen [ 15/Apr/20 ] |
|
I was able to reproduce this issue locally. I looked at the FTDC data, and Flow Control's lag detection does report "lag". |