[SERVER-34722] Add new server status metrics about oplog application Created: 27/Apr/18  Updated: 29/Oct/23  Resolved: 11/Oct/19

Status: Closed
Project: Core Server
Component/s: Diagnostics, Replication
Affects Version/s: None
Fix Version/s: 4.3.1, 4.2.7

Type: Improvement Priority: Major - P3
Reporter: Judah Schvimer Assignee: Judah Schvimer
Resolution: Fixed Votes: 1
Labels: former-quick-wins
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by SERVER-47403 Track the number of sync source chang... Closed
Documented
is documented by DOCS-13094 Investigate changes in SERVER-34722: ... Closed
Issue split
split to SERVER-43318 Add server status metric for average ... Open
Related
related to SERVER-37910 Create new serverStatus metric for nu... Closed
is related to SERVER-37915 Replication doesn't update opsCounter... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.2
Sprint: Repl 2019-08-12, Repl 2019-08-26, Repl 2019-09-09, Repl 2019-09-23, Repl 2019-10-07, Repl 2019-10-21
Participants:

 Description   

Ideas for improvements:
Counts of replSetUpdatePosition commands sent
Counts of heartbeats sent to each node
Counts of heartbeats received from each node
Counts getMores sent to sync source
Lag of the updatePositionLastAppliedOpTime that primaries use to commit oplog entries.
Count of elections run
Count of how often we choose a new sync source (even if it's the same one)
Liveness/state view of every other node in the replica set



 Comments   
Comment by Githook User [ 23/Apr/20 ]

Author:

{'name': 'Judah Schvimer', 'email': 'judah.schvimer@10gen.com', 'username': 'judahschvimer'}

Message: SERVER-34722 Add new server status metrics about oplog application

(cherry picked from commit 9b3801e457c4952e36f2a13d45387d647c301e03)
Branch: v4.2
https://github.com/mongodb/mongo/commit/f776f6c33b6e4a871c760dc416fac73d3bf910fc

Comment by Githook User [ 11/Oct/19 ]

Author:

{'username': 'judahschvimer', 'email': 'judah.schvimer@10gen.com', 'name': 'Judah Schvimer'}

Message: SERVER-34722 Add new server status metrics about oplog application
Branch: master
https://github.com/mongodb/mongo/commit/9b3801e457c4952e36f2a13d45387d647c301e03

Comment by Kelsey Schubert [ 25/Sep/19 ]

I think that's sufficient. Thanks!

Comment by Judah Schvimer [ 11/Sep/19 ]

Counts getMores sent to sync source

kelsey.schubert, How is this different from "metrics.repl.network.getmores.num"?

Lag of the updatePositionLastAppliedOpTime that primaries use to commit oplog entries.

I think what this is referring to is this "appliedOpTime" field in replSetGetStatus. Do we want this in serverStatus too, or is that sufficient?

Comment by Bruce Lucas (Inactive) [ 28/Jan/19 ]

Average parallelism for each batch, updated at end of each batch, could be useful: sum of times spent by individual worker threads applying ops divided by total time for batch.

Comment by Judah Schvimer [ 28/Jan/19 ]

Another metric to consider is how well we are using parallelism in secondary oplog application. I'm not sure of the best way to capture this, but somehow checking if each worker thread on a secondary has a similar number of ops or is working for a similar amount of time per batch. Idle worker threads mean we're not being efficient with our parallelism.

Comment by Judah Schvimer [ 02/Nov/18 ]

Two things to add:

  • Number of ops applied incremented at batch boundaries rather than incremented per op
  • command op counter (opsCounterRepl command) on secondaries appears to not be updated
Comment by Kelsey Schubert [ 01/Nov/18 ]

I think I'm most interested in:

  • Counts getMores sent to sync source
    • especially helpful for subtracting from load on the sync source to better understand application usage
  • Lag of the updatePositionLastAppliedOpTime that primaries use to commit oplog entries.
  • Counts of replSetUpdatePosition commands sent
  • Count of how often we choose a new sync source (even if it's the same one)
    • Same one is hard with current metrics, so could be helpful addition to focus on interesting periods in the logs
    • We've started recently capturing sync source id so changing is easy to spot

Less sure about:

  • Count of elections run (would this just be term #?) or would we include about dry runs?
  • Counts of heartbeats, I doubt there will be new diagnostic information that isn't contained by the heartbeat lags metrics

Already have (unless I'm misunderstanding):

  • Liveness/state view of every other node in the replica set
Comment by Bruce Lucas (Inactive) [ 29/Oct/18 ]

Those metrics sound useful.

Comment by Judah Schvimer [ 27/Apr/18 ]

CC kelsey.schubert bruce.lucas, please add any other metrics you think would be helpful and the best format for metrics.

Generated at Thu Feb 08 04:37:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.