Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- repl-metrics

Assigned Teams:

Query Execution
Sprint:
Repl 2026-02-16
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

We've had some cases where diagnosing the source of lag on the primary was challenging, because there can be many causes. One cause we have seen is that getMore cursors on the primary can't keep up with the rate of writes.

I think it would be tremendously valuable to expose a metric in serverStatus about the getMore lag (time between lastApplied and last returned OpTime). This could be per-replica set node just a maximum of all nodes.

On the primary, surface serverStatus metrics about how much time oplog getMores are spending blocked on the storage engine. The layering may be a bit tricky, but I suspect getmore_cmd.cpp is the right place (see existing special case here )

is related to

SERVER-116300 Expose serverStatus metric for oplog fetching lag

Closed

Assignee:: Unassigned
Reporter:: Denis Trailin
Participants:: Denis Trailin, Louis Williams, Mihai Andrei
Votes:: 0 Vote for this issue
Watchers:: 7 Start watching this issue

Created:: Feb 13 2026 01:02:20 AM UTC
Updated:: Apr 20 2026 03:53:33 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates

PagerDuty