[SERVER-30623] Operation latency and queuing stats are incorrect for adaptive executor Created: 11/Aug/17  Updated: 23/Mar/23  Resolved: 02/Nov/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.5.11
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Backlog - Service Architecture
Resolution: Won't Do Votes: 24
Labels: SWDI, re-triaged-ticket
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-31063 Monitor queued network data for adapt... Closed
Assigned Teams:
Service Arch
Operating System: ALL
Participants:

 Description   

Under heavy load operations can queue up significantly waiting for a worker thread with the adaptive service executor, but in serverStatus operation latencies don't include the time spent waiting in the queue, and queue sizes don't include the size of this queue.



 Comments   
Comment by Lauren Lewis (Inactive) [ 02/Nov/21 ]

The Service Arch team is in the process of cleaning up tickets in the backlog. This ticket has not been updated in two years so we are closing it. Please reopen if you think this change is valuable.

Comment by Bruce Lucas (Inactive) [ 20/Dec/17 ]

As you indicate the issue occurs because the adaptive service executor reads requests from the kernel "reluctantly" so they remain queued in the kernel (as indicated by rx_queue) where we measure neither the time spent in the kernel nor the number queued in the kernel.

I think we could accurately measure queues and latency if it were possible to do one of the following:

  • "Eagerly" read requests from the kernel as soon as possible and queue them as tasks in the adaptive executor for later execution, or
  • Note the presence of available data on a given connection (but don't read the request) as soon as possible and at that point begin the latency timer for that request and bump a queue metric.
Comment by Henrik Edin [ 12/Dec/17 ]

I discussed this a little with jonathan.reams and he might want to jump into this discussion. I think rolling tcp queue or executor queue times into globalLock and/or opLatencies is not a good idea as we probably want to be able to differentiate what's what. I think extending FTDC with /proc/net/tcp data makes lots of sense but we don't do any other proc parsing in serverStatus() so I don't think we should do it here either.
From the graph in SERVER-31063 it looks like the latency is coming from tx_queue & rx_queue and not totalTimeQueuedMicros in the adaptive executor. So without adding proc parsing to serverStatus() it is not much else we can add?

Comment by Bruce Lucas (Inactive) [ 11/Dec/17 ]

To be more specific, the following serverStatus metrics are important for gauging system load and performance:

globalLock.currentQueue.{readers,writers}
opLatencies.latency.{reads,writes,commands}

Can or should the queues and latency internal to the adaptive executor be rolled up into these? (In the case of globalLock.currentQueue it might be more appropriate to introduce a new metric that includes both).

Comment by Bruce Lucas (Inactive) [ 12/Sep/17 ]

Note related issue SERVER-31063 that should be considered when deciding how to surface queueing metrics for the adaptive case.

Generated at Thu Feb 08 04:24:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.