[SERVER-71381] ReservedServiceExecutor: actively recover from spawn failure Created: 15/Nov/22  Updated: 30/Nov/22  Resolved: 30/Nov/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Billy Donahue Assignee: Billy Donahue
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-70151 ServiceExecutorSynchronous thread_loc... Closed
Operating System: ALL
Sprint: Service Arch 2022-12-12
Participants:

 Description   

ServiceExecutorReserved is designed to allow dealing with spawn failures.

Spawn failures are a temporary error. The OS can fail to spawn for some period of time and regain the spawn ability later on.

In the pre- SERVER-70151 ServiceExecutorReserved, schedule() calls would place tasks on a singleton queue, and try to spawn a thread but did not depend on that spawn succeeding.

If there is a period of time in which spawns fail, reserved service executor would still be able to hand incoming scheduled tasks to its established pool of reserved workers. When an idle worker starts its loop iteration by receiving a task, it spawns a new worker to replace itself if worker count is below quota. When it completes a chain of tasks (now called a lease), it decides whether to die or merely go idle, again considering the reserve quota. So workers reproduce only when embarking on a task chain.

The problem:

If spawns fail and reserve is exhausted, tasks will be queued. Suppose the OS recovers and spawns are then possible again. The reserve SvcExec would only find out about it when a reserve thread finishes its task chain and goes idle and spawns.

Review of SERVER-70151 discovered this problem but fixing it was out of scope.

Some kind of spawn retry loop initiated when spawn failures occur would probably mitigate the issue.



 Comments   
Comment by Billy Donahue [ 30/Nov/22 ]

The final implementation of SERVER-70151 fixed this problem as a side effect.

When we run out of reserve threads and can't spawn, ingress worker lease requests will throw. Work does not queue up, so there's nothing to wake up when spawns are okay again. They'll just start working again. Furthermore, we are now reusing old workers instead of unconditionally destroying them at the end of an ingress Session. So even if we can't spawn new workers, we will be able to reuse existing busy workers when their connections end and they're returned to the ready bucket.

Generated at Thu Feb 08 06:18:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.