[SERVER-41187] Majority committed replication lag spikes after an election Created: 16/May/19 Updated: 14/Nov/19 Resolved: 14/Nov/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Maria van Keulen | Assignee: | Lingzhi Deng |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Sprint: | Repl 2019-10-21, Repl 2019-11-04, Repl 2019-11-18 | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
One of my scripts for testing Flow Control prompts an election. I've noticed that in some runs of the test, there is a spike in reported lastCommitted lag that occurs immediately after the election. This spike occurred both when the election chose the same primary as previously and when the election chose a different node as primary. Flow Control is a consumer of lastCommitted lag, so it will respond to the spike and throttle writes. I don't see why lastCommitted lag should have a reason to spike, particularly if the same primary is elected, so I believe this behavior should be investigated. |
| Comments |
| Comment by Lingzhi Deng [ 14/Nov/19 ] |
|
Closing as the investigation is done. We can continue the conversation in |
| Comment by Maria van Keulen [ 14/Nov/19 ] |
|
lingzhi.deng and I discussed this lag overstatement issue further today, and I've filed |
| Comment by Maria van Keulen [ 21/May/19 ] |
|
My hypothesis is that when the new primary starts accepting writes, there is a brief window of time during which some of the secondaries have not established their new sync source yet, so they are not replicating. Here is a screenshot of FTDC data from a script that forces an election of the same primary and generates majority committed lag. Node 0 is the primary, Nodes 1 through 3 are periodically stopped to induce replication lag, and Node 4 is allowed to replicate writes as normal. |
| Comment by Maria van Keulen [ 16/May/19 ] |
|
This may be related to |