[SERVER-63825] Report metrics for read/write queues Created: 18/Feb/22  Updated: 29/Oct/23  Resolved: 13/Apr/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 6.0.0-rc0

Type: New Feature Priority: Major - P3
Reporter: Daniel Gomez Ferro Assignee: Daniel Gomez Ferro
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-64467 Remove global service context for tic... Closed
Problem/Incident
causes SERVER-65660 Coverity analysis defect 122084: Unsa... Closed
causes SERVER-65890 Coverity analysis defect 122131: Dere... Closed
Backwards Compatibility: Fully Compatible
Sprint: Execution Team 2022-03-21, Execution Team 2022-04-04, Execution Team 2022-04-18
Participants:

 Description   

Similarly to what we do for read/write tickets., we need to report metrics for the read/write queues:

  • Concurrency limit
  • Ops queued:
    • Total added
    • Total removed
    • Instantaneous number
    • Total time
  • Ops being processed:
    • Total added
    • Total removed
    • Instantaneous number
    • Total time
  • Total timed out operations


 Comments   
Comment by Githook User [ 11/Apr/22 ]

Author:

{'name': 'Daniel Gómez Ferro', 'email': 'daniel.gomezferro@mongodb.com', 'username': 'dgomezferro'}

Message: SERVER-63825 Add metrics for read/write queues
Branch: master
https://github.com/mongodb/mongo/commit/ac3c13f7e6b641dea363dc12eb8cabc5eeb5daad

Comment by Bruce Lucas (Inactive) [ 01/Mar/22 ]

That all sounds good to me, thanks.

Comment by Daniel Gomez Ferro [ 01/Mar/22 ]

bruce.lucas I updated the description with a new proposal, I like the idea of grouping metrics per lifecycle stage, I think there are only two stages: the operation is in a queue or being processed.

For now we've decided to remove the latency histogram to keep this simple, I'll open a new ticket to report the max latency in a rolling window interval.

Comment by Bruce Lucas (Inactive) [ 22/Feb/22 ]

Both of those metrics are a bit problematic because they are relative to the time interval they're measured over, so to include in FTDC you have to fix the time interval (1 sec? 1 min? Since the last serverStatus). By contrast averages can be computed from downstream tools (t2) from cumulative metrics over any desired time interval.

Median usually doesn't add much that average doesn't tell you, so maybe not needed?

99th %ile could be useful if you can accept the limitation I mentioned above. Computing it accurately might be expensive, but there's probably an efficient way to compute a useful approximation. Max might be just as useful (and that also can be computed over any time interval given max over every second).

You could include a coarse histogram in FTDC if that would be useful. Alternatively some other projects have chosen to include a histogram in serverStatus to support targeted investigations, but exclude it from FTDC to save space.

Comment by Louis Williams [ 21/Feb/22 ]

Regarding histograms, the goal is just to understand tail latencies, so perhaps we can surface just median and 99th percentile latencies?

Comment by Bruce Lucas (Inactive) [ 18/Feb/22 ]

Thanks.

Queue length: cumulative number of enqueued operations

You didn't mean "queue length" for this did you? I assume "enqueued" and "processed" are two events in the lifecycle of an operation; maybe another one might be "processing"? Generally speaking, it could be useful to have 1) instantaneous number of operations in each phase of an operation lifecycle, and 2) cumulative time spent in each phase.

Comment by Daniel Gomez Ferro [ 18/Feb/22 ]

Thanks bruce.lucas, I've updated the set of metrics based on your feedback.

I'll check in with the rest of the team regarding histograms.

Comment by Bruce Lucas (Inactive) [ 18/Feb/22 ]

The general pattern for wait times in serverStatus is to report cumulative wait time (see various timeAcquiringMicros metrics). Downstream tooling (t2) then uses this to compute average queue lengths over any desired interval. This is a useful complement to instantaneous queue lengths, which can suffer from sampling artifacts.

Averages shouldn't be reported directly in serverStatus as this will be less flexible because the period over which average is computed is fixed. Generally we avoid histograms in FTDC because they don't have much diagnostic value over and above averages, and take a lot of space.

Generated at Thu Feb 08 05:58:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.