[SERVER-63825] Report metrics for read/write queues Created: 18/Feb/22 Updated: 29/Oct/23 Resolved: 13/Apr/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 6.0.0-rc0 |
| Type: | New Feature | Priority: | Major - P3 |
| Reporter: | Daniel Gomez Ferro | Assignee: | Daniel Gomez Ferro |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Sprint: | Execution Team 2022-03-21, Execution Team 2022-04-04, Execution Team 2022-04-18 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
Similarly to what we do for read/write tickets., we need to report metrics for the read/write queues:
|
| Comments |
| Comment by Githook User [ 11/Apr/22 ] |
|
Author: {'name': 'Daniel Gómez Ferro', 'email': 'daniel.gomezferro@mongodb.com', 'username': 'dgomezferro'}Message: |
| Comment by Bruce Lucas (Inactive) [ 01/Mar/22 ] |
|
That all sounds good to me, thanks. |
| Comment by Daniel Gomez Ferro [ 01/Mar/22 ] |
|
bruce.lucas I updated the description with a new proposal, I like the idea of grouping metrics per lifecycle stage, I think there are only two stages: the operation is in a queue or being processed. For now we've decided to remove the latency histogram to keep this simple, I'll open a new ticket to report the max latency in a rolling window interval. |
| Comment by Bruce Lucas (Inactive) [ 22/Feb/22 ] |
|
Both of those metrics are a bit problematic because they are relative to the time interval they're measured over, so to include in FTDC you have to fix the time interval (1 sec? 1 min? Since the last serverStatus). By contrast averages can be computed from downstream tools (t2) from cumulative metrics over any desired time interval. Median usually doesn't add much that average doesn't tell you, so maybe not needed? 99th %ile could be useful if you can accept the limitation I mentioned above. Computing it accurately might be expensive, but there's probably an efficient way to compute a useful approximation. Max might be just as useful (and that also can be computed over any time interval given max over every second). You could include a coarse histogram in FTDC if that would be useful. Alternatively some other projects have chosen to include a histogram in serverStatus to support targeted investigations, but exclude it from FTDC to save space. |
| Comment by Louis Williams [ 21/Feb/22 ] |
|
Regarding histograms, the goal is just to understand tail latencies, so perhaps we can surface just median and 99th percentile latencies? |
| Comment by Bruce Lucas (Inactive) [ 18/Feb/22 ] |
|
Thanks.
You didn't mean "queue length" for this did you? I assume "enqueued" and "processed" are two events in the lifecycle of an operation; maybe another one might be "processing"? Generally speaking, it could be useful to have 1) instantaneous number of operations in each phase of an operation lifecycle, and 2) cumulative time spent in each phase. |
| Comment by Daniel Gomez Ferro [ 18/Feb/22 ] |
|
Thanks bruce.lucas, I've updated the set of metrics based on your feedback. I'll check in with the rest of the team regarding histograms. |
| Comment by Bruce Lucas (Inactive) [ 18/Feb/22 ] |
|
The general pattern for wait times in serverStatus is to report cumulative wait time (see various timeAcquiringMicros metrics). Downstream tooling (t2) then uses this to compute average queue lengths over any desired interval. This is a useful complement to instantaneous queue lengths, which can suffer from sampling artifacts. Averages shouldn't be reported directly in serverStatus as this will be less flexible because the period over which average is computed is fixed. Generally we avoid histograms in FTDC because they don't have much diagnostic value over and above averages, and take a lot of space. |