[SERVER-41187] Majority committed replication lag spikes after an election Created: 16/May/19  Updated: 14/Nov/19  Resolved: 14/Nov/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Maria van Keulen Assignee: Lingzhi Deng
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2019-05-21 at 4.31.03 PM.png    
Issue Links:
Related
related to SERVER-44634 Account for election down time when c... Closed
is related to SERVER-32703 Secondary can take a couple minutes t... Closed
Operating System: ALL
Sprint: Repl 2019-10-21, Repl 2019-11-04, Repl 2019-11-18
Participants:

 Description   

One of my scripts for testing Flow Control prompts an election. I've noticed that in some runs of the test, there is a spike in reported lastCommitted lag that occurs immediately after the election. This spike occurred both when the election chose the same primary as previously and when the election chose a different node as primary. Flow Control is a consumer of lastCommitted lag, so it will respond to the spike and throttle writes. I don't see why lastCommitted lag should have a reason to spike, particularly if the same primary is elected, so I believe this behavior should be investigated.



 Comments   
Comment by Lingzhi Deng [ 14/Nov/19 ]

Closing as the investigation is done. We can continue the conversation in SERVER-44634 instead.

Comment by Maria van Keulen [ 14/Nov/19 ]

lingzhi.deng and I discussed this lag overstatement issue further today, and I've filed SERVER-44634 to track the fix. I think it's fair to focus on when it occurs due to elections rather than when it occurs due to the replica set being idle for extended periods, since elections are a more discernible case.

Comment by Maria van Keulen [ 21/May/19 ]

My hypothesis is that when the new primary starts accepting writes, there is a brief window of time during which some of the secondaries have not established their new sync source yet, so they are not replicating. Here is a screenshot of FTDC data from a script that forces an election of the same primary and generates majority committed lag. Node 0 is the primary, Nodes 1 through 3 are periodically stopped to induce replication lag, and Node 4 is allowed to replicate writes as normal.

In the interval between B and C, the primary is accepting writes, but Node 4 has not yet re-established Node 0 as its sync source, as evidenced by the gap in the FTDC data for syncSourceId. The lastCommitted lag spikes in this interval. I am sending this to Replication to investigate further.

Comment by Maria van Keulen [ 16/May/19 ]

This may be related to SERVER-32703. I did notice that in the runs of my test case that had these lag spikes, several secondaries were missing syncSourceId data in FTDC.

Generated at Thu Feb 08 04:57:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.