[SERVER-71381] ReservedServiceExecutor: actively recover from spawn failure Created: 15/Nov/22 Updated: 30/Nov/22 Resolved: 30/Nov/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Billy Donahue | Assignee: | Billy Donahue |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Sprint: | Service Arch 2022-12-12 | ||||||||
| Participants: | |||||||||
| Description |
|
ServiceExecutorReserved is designed to allow dealing with spawn failures. Spawn failures are a temporary error. The OS can fail to spawn for some period of time and regain the spawn ability later on. In the pre- If there is a period of time in which spawns fail, reserved service executor would still be able to hand incoming scheduled tasks to its established pool of reserved workers. When an idle worker starts its loop iteration by receiving a task, it spawns a new worker to replace itself if worker count is below quota. When it completes a chain of tasks (now called a lease), it decides whether to die or merely go idle, again considering the reserve quota. So workers reproduce only when embarking on a task chain. The problem: If spawns fail and reserve is exhausted, tasks will be queued. Suppose the OS recovers and spawns are then possible again. The reserve SvcExec would only find out about it when a reserve thread finishes its task chain and goes idle and spawns. Review of Some kind of spawn retry loop initiated when spawn failures occur would probably mitigate the issue. |
| Comments |
| Comment by Billy Donahue [ 30/Nov/22 ] |
|
The final implementation of When we run out of reserve threads and can't spawn, ingress worker lease requests will throw. Work does not queue up, so there's nothing to wake up when spawns are okay again. They'll just start working again. Furthermore, we are now reusing old workers instead of unconditionally destroying them at the end of an ingress Session. So even if we can't spawn new workers, we will be able to reuse existing busy workers when their connections end and they're returned to the ready bucket. |