-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: None
-
None
-
Catalog and Routing
-
Fully Compatible
-
ALL
-
v8.3
-
CAR Team 2026-03-30, CAR Team 2026-04-13
-
200
-
None
-
None
-
None
-
None
-
None
-
None
-
None
User experience
Service unavailability for long running operations as they will continue to queue even though there are tickets available.
Server configuration
- De-prioritization must be enabled
- Busy workload with a lot of deprioritizable operations
Technical details
Suppose we have 2 tickets available and no limits in the queue size. The following sequence might cause low priority operations to queue indefinetely:
T1: Thread A acquires a ticket
T2: Thread B acquires a ticket
T3: Thread C tries to acquire a ticket, it is enqueued as there are no tickets left. Threads queued = 1
T4: Thread A releases a ticket, tries to wake up C. Tickets available = 1
T5: Thread B sneaks in and releases another ticket, before C manages to grab the lock before waking up. Sees thread C as awake, so, it does nothing. Tickets available = 2
T6: Thread D sneaks in and tries to acquire a ticket, again, before C manage to grab the lock before waking up. It sees C queued so to prevent skiping the queue, it is also enqueued. Threads queued = 2
T7: Thread C finally grabs the lock, and takes a ticket. Tickets available = 1
This sequence leaves Thread D waiting for a release that might never come, in particular, if we're doing a resize, which will not return the ticket. More importantly, any subsequent thread will see the queue is not empty, and will also be enqueued.
Suggested solutions
1. Prevent enqueing threads if there are more tickets than queued elements: we can be sure that all the threads in the queue will eventually wake up, so, the fastpath can safely grab a permit if it is available.
2. Make threads that have acquired a ticket, wake up others: that way in the case we just saw, at T7 we would wake up Thread D after acquiring the ticket.
- is caused by
-
SERVER-119153 Implement a priority queue for long running operations
-
- Closed
-
- related to
-
SERVER-122763 Validate OrderedTicketSemaphore tryAcquire/acquire/release with a TLA+ model
-
- In Code Review
-
-
SERVER-120543 Change TicketSemaphore acquire to enqueue directly instead of trying tryAcquire first
-
- Closed
-