[SERVER-63883] Report operation latency from receiving request to sending response Created: 22/Feb/22  Updated: 18/May/23

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Shameek Ray
Resolution: Unresolved Votes: 0
Labels: diagnostics
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-73721 Remove metrics collection and reporti... Closed
related to SERVER-74236 Slow SessionWorkflow loop gets logged... Closed
related to SERVER-55638 Diagnostics for time spent sending re... Closed
Assigned Teams:
Service Arch
Sprint: Service Arch 2022-09-19, Service Arch 2022-10-03, Service Arch 2023-02-20
Participants:
Linked BF Score: 105

 Description   

The slow query latencies reported in the mongod log and in ftdc are based on a timer in CurOp that does not include substantial mongod code path from the time the request is received until the timer is started, and from when the timer is stopped until the response is transmitted. This can result in the mongod latencies being substantially underreported in the logs and in ftdc. Some examples:

  • If the server is bottlenecked on CPU with a large number of concurrent simple queries, we've seen client-side latencies of >10 ms (most of which was CPU time in mongod) whereas latencies reported in FTDC and logs are <<1 ms, because the untimed overhead was much larger than the time needed to actually execute the query.
  • We have seen bottlenecks in places like the allocator or SSL processing that resulted in significant query latency but which is not reported in FTDC or logs because it was outside the timed code path. Reporting the entire duration including the entire request/response lifecycle code path would help us diagnose problems outside the currently timed code path.

Would it be possible to start a timer when recv() returns a new request from the o/s, and stop it when send() returns from sending the response to the o/s? TBD how or where this would be reported - in place of current latency metrics in FTDC and logs? As new field in slow query logs alongside durationMillis? As new separate log entry with its own durationMillis?



 Comments   
Comment by Blake Oler [ 09/Feb/23 ]

Note that we're going to go back to the well with TSEs to figure out a broader set of requirements to capture the information here, so that we on eng can try different approaches to see if any have a less prominent effect on performance.

Comment by Blake Oler [ 07/Feb/23 ]

We need to revert the current implementations of the listed changes due to performance costs. Will re-triage as an independent ticket outside of the scope of the Cluster Connection Health Metrics project.

Comment by Matt Diener (Inactive) [ 20/Sep/22 ]

SERVER-63883 and SERVER-55638 are being replaced by SERVER-69830 and SERVER-69831.

The end result is the same, this change is being made to better reflect the work that's being done. I prefer keeping the original tickets' descriptions as-is for future reference.

Generated at Thu Feb 08 05:58:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.