Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- diagnostics
- sa-backlog

Assigned Teams:

Server Programmability
Sprint:
Service Arch 2022-09-19, Service Arch 2022-10-03, Service Arch 2023-02-20
Linked BF Score:
105
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

The slow query latencies reported in the mongod log and in ftdc are based on a timer in CurOp that does not include substantial mongod code path from the time the request is received until the timer is started, and from when the timer is stopped until the response is transmitted. This can result in the mongod latencies being substantially underreported in the logs and in ftdc. Some examples:

If the server is bottlenecked on CPU with a large number of concurrent simple queries, we've seen client-side latencies of >10 ms (most of which was CPU time in mongod) whereas latencies reported in FTDC and logs are <<1 ms, because the untimed overhead was much larger than the time needed to actually execute the query.
We have seen bottlenecks in places like the allocator or SSL processing that resulted in significant query latency but which is not reported in FTDC or logs because it was outside the timed code path. Reporting the entire duration including the entire request/response lifecycle code path would help us diagnose problems outside the currently timed code path.

Would it be possible to start a timer when recv() returns a new request from the o/s, and stop it when send() returns from sending the response to the o/s? TBD how or where this would be reported - in place of current latency metrics in FTDC and logs? As new field in slow query logs alongside durationMillis? As new separate log entry with its own durationMillis?

related to

SERVER-73721 Remove metrics collection and reporting from the SessionWorkflow

Closed

SERVER-74236 Slow SessionWorkflow loop gets logged for streamable SDAM operations

Closed

SERVER-55638 Diagnostics for time spent sending responses to ingress queries

Closed

Assignee:: Shameek Ray
Reporter:: Bruce Lucas (Inactive)
Participants:: Blake Oler, Bruce Lucas, Matt Diener, Shameek Ray
Votes:: 0 Vote for this issue
Watchers:: 16 Start watching this issue

Created:: Feb 22 2022 07:08:11 PM UTC
Updated:: Sep 30 2024 07:32:20 PM UTC
Confidence Status Last Update:: 06/Sep/22 4:56 PM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates