Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 8.3.0-rc0
Affects Version/s: None
Component/s: None
Labels:
- repl-metrics
- repl-shortlist

Assigned Teams:

Replication
Backwards Compatibility:
Fully Compatible
Backport Requested:

v8.2, v8.0, v7.0
Sprint:
Repl 2026-02-16
Linked BF Score:
200
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

We've had some cases where diagnosing the source of lag on the primary was challenging, because there can be many causes. One cause we have seen is that getMore cursors on the primary can't keep up with the rate of writes.

I think it would be tremendously valuable to expose a metric in serverStatus about the getMore lag (time between lastApplied and last returned OpTime). This could be per-replica set node just a maximum of all nodes.

I POCed something that I'll attach, but it doesn't care about multiple nodes (i.e. each update of the metric from overwrites the previous metric, which could have been a different node).

is fixed by

SERVER-119766 Fix data race in oplogFetcherHighestFetchedOptime metric

Closed

related to

SERVER-119880 Revert SERVER-116300

Closed

SERVER-119766 Fix data race in oplogFetcherHighestFetchedOptime metric

Closed

SERVER-119647 Surface serverStatus metrics about how much time oplog getMores are spending blocked on the storage engine

Needs Scheduling

Assignee:: Denis Trailin
Reporter:: Louis Williams
Participants:: Denis Trailin, Githook User, Louis Williams
Votes:: 0 Vote for this issue
Watchers:: 8 Start watching this issue

Created:: Jan 07 2026 03:47:12 PM UTC
Updated:: Mar 30 2026 07:19:19 PM UTC
Resolved:: Feb 14 2026 01:08:36 AM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates