[SERVER-45880] Flow Control lag detection mechanism can overstate lag if there are oplog holes Created: 30/Jan/20  Updated: 27/Oct/23  Resolved: 16/Feb/21

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Maria van Keulen Assignee: Dianna Hohensee (Inactive)
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-45881 Investigate and implement desired Flo... Closed
is related to SERVER-46114 Flow-control engages on a single-node... Closed
is related to SERVER-54576 Add invariants that no network calls ... Closed
is related to SERVER-54581 Report the WT all_durable timestamp i... Closed
Operating System: ALL
Sprint: Execution Team 2020-07-27, Execution Team 2021-02-08, Execution Team 2021-02-22
Participants:

 Description   

Flow Control uses the lastApplied wall clock time minus the lastCommitted wall clock time as a proxy for replication lag. This measure can overstate the lag if there are oplog holes, since lastApplied can include operations after oplog holes, which cannot be replicated by secondaries due to the oplog hole.

One proposed fix to address this is to use the wall clock time associated with the all_durable timestamp or the oplog visibility point instead of the lastApplied wall clock time, since these points do not include operations after oplog holes.

Any solution to this issue that involves changing the components of the lag detection mechanism should ensure that 1) a wall clock time is available for the proposed timestamp 2) the proposed timestamp is accessible in-memory and is kept up-to-date.

SERVER-46114 represents another case for reconsidering whether lastApplied minus lastCommitted is the best measure for lag.


Generated at Thu Feb 08 05:09:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.