-
Type: Improvement
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Server Programmability
-
Service Arch 2022-09-19, Service Arch 2022-10-03, Service Arch 2023-02-20
-
105
The slow query latencies reported in the mongod log and in ftdc are based on a timer in CurOp that does not include substantial mongod code path from the time the request is received until the timer is started, and from when the timer is stopped until the response is transmitted. This can result in the mongod latencies being substantially underreported in the logs and in ftdc. Some examples:
- If the server is bottlenecked on CPU with a large number of concurrent simple queries, we've seen client-side latencies of >10 ms (most of which was CPU time in mongod) whereas latencies reported in FTDC and logs are <<1 ms, because the untimed overhead was much larger than the time needed to actually execute the query.
- We have seen bottlenecks in places like the allocator or SSL processing that resulted in significant query latency but which is not reported in FTDC or logs because it was outside the timed code path. Reporting the entire duration including the entire request/response lifecycle code path would help us diagnose problems outside the currently timed code path.
Would it be possible to start a timer when recv() returns a new request from the o/s, and stop it when send() returns from sending the response to the o/s? TBD how or where this would be reported - in place of current latency metrics in FTDC and logs? As new field in slow query logs alongside durationMillis? As new separate log entry with its own durationMillis?
- related to
-
SERVER-73721 Remove metrics collection and reporting from the SessionWorkflow
- Closed
-
SERVER-74236 Slow SessionWorkflow loop gets logged for streamable SDAM operations
- Closed
-
SERVER-55638 Diagnostics for time spent sending responses to ingress queries
- Closed