Surface serverStatus metrics about how much time oplog getMores are spending blocked on the storage engine

    • Type: Improvement
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Replication
    • Repl 2026-02-16
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      We've had some cases where diagnosing the source of lag on the primary was challenging, because there can be many causes. One cause we have seen is that getMore cursors on the primary can't keep up with the rate of writes.

      I think it would be tremendously valuable to expose a metric in serverStatus about the getMore lag (time between lastApplied and last returned OpTime). This could be per-replica set node just a maximum of all nodes.

      On the primary, surface serverStatus metrics about how much time oplog getMores are spending blocked on the storage engine. The layering may be a bit tricky, but I suspect getmore_cmd.cpp is the right place (see existing special case here

            Assignee:
            Unassigned
            Reporter:
            Denis Trailin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: