[SERVER-74545] PriorityTicketHolder doesn't track operations that requeue after 500millis Created: 02/Mar/23  Updated: 18/Apr/23  Resolved: 14/Mar/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Haley Connelly Assignee: Backlog - Storage Execution Team
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-75677 Update ticketholder queueing time whi... Backlog
Assigned Teams:
Storage Execution
Participants:

 Description   

It could be interesting to track the number of operations that time out at 500 milliseconds in a queue, wake up, and requeue for a ticket.

Motivation: It could provide insight into what conditions cause the operations to get stuck in the queue & the side effects on latency and throughput when operations must wakeup to requeue.

Example: Suppose 50th percentile latency is ~500 milliseconds, do we see higher tail latencies than expected? should we reconsider the 500 milliseconds timeout?

Right now, we measure the number of cumulative number operations queued in the PriorityTicketHolder at the TicketHolderWithQueueingStats level. This means, it does not take into account the number of items that must requeue.



 Comments   
Comment by Haley Connelly [ 03/Mar/23 ]

I doubt it is the cause of too much, but given we know the PriorityTicketHolder is slightly slower due to extra concurrency synchronisation, I was wondering if 500millis isn't enough to give operations a chance when queueing is high. 

Comment by Louis Williams [ 03/Mar/23 ]

The 500ms timeout + requeue is also happening in the semaphore ticketholder, so I would be interested if this is the cause and if so, why it is more expensive.

Generated at Thu Feb 08 06:27:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.