[SERVER-64075] Report max latency in 1 minute rolling window Created: 01/Mar/22  Updated: 26/Oct/23

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: New Feature Priority: Major - P3
Reporter: Daniel Gomez Ferro Assignee: Backlog - Storage Execution Team
Resolution: Unresolved Votes: 0
Labels: former-storex-namer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Storage Execution
Sprint: Execution Team 2022-05-02
Participants:

 Description   

Add max latency to the set of reported metrics (SERVER-63825) . It should be computed in a 1 minute rolling window.



 Comments   
Comment by Louis Williams [ 04/Mar/22 ]

Another option, which apparently we already do for the "opLatencies" serverStatus section, is to just have a histogram for the entire duration of the process. We could make it off-by-default and only for debugging purposes.

We already have a histogram implementation in the code, so it would be relatively straightforward to capture and report this information.

Comment by Bruce Lucas (Inactive) [ 02/Mar/22 ]

Not sure what the best semantics are. Since this is in serverStatus we might also want to think about the impact of the various solutions on other users of serverStatus. Resetting the max on every request for serverStatus would not give good results. Resetting it only from the FTDC thread would require a new mechanism, and would be unlike anything else we do, so not sure of the negative consequences. The cleanest semantics from the serverStatus perspective might be that a serverStatus sample at time t gives max over floor(t)-1 to floor(t), but I haven't thought this through carefully, and as you say if FTDC takes more than a second then it misses a seconds worth of samples.

Another approach to handling max might be to emit a log message every time a sufficiently slow incident occurs, like we do with slow queries. If this has an attr.durationMillis then it can be used by tooling to graphically display it.

Comment by Daniel Gomez Ferro [ 02/Mar/22 ]

bruce.lucas for max latency I agree it makes sense to match the FTDC period, we were discussing this in the context of supporting percentiles in the future (for our internal experiments) but we can change it when the time comes.

louis.williams I thought it would be possible to reset the metric from the FTDC thread itself, so there shouldn't be any lost data even if for some reason FTDC takes longer than 1s to run. Do we want the FTDC thread to be read-only instead?

Comment by Louis Williams [ 02/Mar/22 ]

bruce.lucas, that's a good point. Our original thinking was that 1 second would potentially be too noisy, but since FTDC polls every second, and considering how much happens in the server every second, I think this would be the best solution. As you pointed out, this would be costly to a large window.

I don't think the rolling window would require any sorting, but it would require someone, either the observer or the operations themselves, to re-scan the window every time a value drops off. And if we process thousands of operations per second, this would be prohibitively expensive.

The problem with computing non-overlapping maxima is that we would lose information if FTDC runs in the middle of each period. Or, if we return the previous second's data, this might be a bit misleading. Are either of those tradeoffs worth it?

Comment by Bruce Lucas (Inactive) [ 01/Mar/22 ]

Would a 1-second window (matching FTDC interval) be better? Then max latency can be computed over any interval larger than 1 second.

Also, is it feasible to compute over a rolling window with acceptable performance, especially one as long as 1 minute? Doesn't this require maintaining a sorted list of latencies over the window so that as each latency rolls out of the window you can re-compute the max over the latencies still in the window, or am I missing something? Reporting max latency over every non-overlapping 1 second interval would be much more efficient and would contain almost as much information.

Generated at Thu Feb 08 05:59:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.