[SERVER-69831] Report on metrics gathered in the SessionWorkflow loop Created: 20/Sep/22  Updated: 02/Nov/22  Resolved: 26/Oct/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 6.2.0-rc0

Type: Improvement Priority: Major - P3
Reporter: Matt Diener (Inactive) Assignee: Matt Diener (Inactive)
Resolution: Done Votes: 0
Labels: diagnostics
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-69830 Capture metrics for the SessionWorkfl... Closed
Documented
is documented by DOCS-15705 Investigate changes in SERVER-69831: ... Closed
Problem/Incident
Related
is related to SERVER-71029 Update slow SessionWorkflow log criteria Closed
Backwards Compatibility: Minor Change
Sprint: Service Arch 2022-10-03, Service Arch 2022-10-17, Service Arch 2022-10-31
Participants:
Linked BF Score: 105

 Description   

Context: SERVER-55638 and SERVER-63883 were both logged before our rework to SessionWorkflow started. They are both being closed in favor of an approach that consolidates the two implementations and allows for us to spend some time clarifying requirements with bruce.lucas@mongodb.com.

Previous task: (SERVER-69830) ** At each stage of the SessionWorkflow, capture `timeSpent` and store it in an object associated 1:1 with each loop.

This task: Report on these gathered metrics in a manner that's satisfactory to the requirements of stakeholders of the original tickets: SERVER-55638 and SERVER-63883.

Requirements should be discussed with bruce.lucas@mongodb.com.



 Comments   
Comment by Githook User [ 20/Oct/22 ]

Author:

{'name': 'Matt Diener', 'email': 'matt.diener@mongodb.com', 'username': 'mattdiener'}

Message: SERVER-69831 Use slowQuery threshold for slow SessionWorkflow log
Branch: master
https://github.com/mongodb/mongo/commit/5249027fda43443584bbabb3903a9f60db30301a

Comment by Matt Diener (Inactive) [ 27/Sep/22 ]

I have reason to believe that tracking all connections as they move through these states at this granularity will likely introduce some performance degradations that we don't want to take on, whereas a count of outliers is less likely to cause those problems. At that point FTDC doesn't seem much better than logging, from my perspective.

The biggest question from my perspective pertains to configurability, and determining the threshold for "slow". Can we use the same time threshold as slow queries or is there a reasonable case to be made for adding a 2nd value that can be configured?

I'm content if we use our best judgement for the rest.

Comment by Bruce Lucas (Inactive) [ 27/Sep/22 ]

matt.diener@mongodb.com regarding the log lines, I think you're asking about the relative merits of a couple of design alternatives, but I can't quite follow what the alternatives are. Can you give a couple short examples of the alternatives to clarify?

Generally speaking, I think the requirement is to log when sending a response (or receiving a request, I think) is slow, as this causes the client to see a slow response, whereas given the current logging we don't see any indication of that slowness. Ideally it should be possible to tie this slow response back to details of the query in some way. I SERVER-55638 I mentioned a couple of options.

Regarding FTDC, I think it would be useful to have some information in FTDC. Generally the kind of information that's useful is number of connections in a particular state, and/or cumulative time spent between particular state transitions. I think there was some discussion in the design doc regarding some useful metrics of this type. I'm not sure what metrics you have in mind to replace.

 

 

Comment by Matt Diener (Inactive) [ 21/Sep/22 ]

bruce.lucas@mongodb.com – some detailed questions that were not answered in the design:

  1. We're worried about changing when/how often the slow query log outputs because of tooling that is currently being built off of that log. It seems like the kind of thing which could have unforeseen downstream impact and cause more trouble than the change merits. Is that intuition reasonable?
    • If so, our plan is to have a 2nd log that's just tied to the SessionWorkflow loop (the loop that does receiveMessage -> processMessage -> sendMessageIfNeeded). We'll measure each individual step, and the entirety of the flow where the server is not idling in the session.
  2. We were weighing configurability vs. simplicity in the log output of these diagnostics. Is it reasonable to use the same configured slow query timer to decide whether we want to log? Does it make sense to add a new configurable threshold or will that just make it harder to do configuration?
    • Our thoughts are that if we use the slow query log threshold, this log will always appear (its total elapsed time includes the entirety of the slow query's elapsed time). If the slowness associated with a query happens to land outside of the work captured by the slow query log, only this log will appear.
  3. Do you see value in any FTDC reporting over all of these metrics? The design only calls out operation latency.
    • If so, should they be a replacement to existing metrics?
Generated at Thu Feb 08 06:14:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.