Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-71381

ReservedServiceExecutor: actively recover from spawn failure

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • ALL
    • Service Arch 2022-12-12

      ServiceExecutorReserved is designed to allow dealing with spawn failures.

      Spawn failures are a temporary error. The OS can fail to spawn for some period of time and regain the spawn ability later on.

      In the pre- SERVER-70151 ServiceExecutorReserved, schedule() calls would place tasks on a singleton queue, and try to spawn a thread but did not depend on that spawn succeeding.

      If there is a period of time in which spawns fail, reserved service executor would still be able to hand incoming scheduled tasks to its established pool of reserved workers. When an idle worker starts its loop iteration by receiving a task, it spawns a new worker to replace itself if worker count is below quota. When it completes a chain of tasks (now called a lease), it decides whether to die or merely go idle, again considering the reserve quota. So workers reproduce only when embarking on a task chain.

      The problem:

      If spawns fail and reserve is exhausted, tasks will be queued. Suppose the OS recovers and spawns are then possible again. The reserve SvcExec would only find out about it when a reserve thread finishes its task chain and goes idle and spawns.

      Review of SERVER-70151 discovered this problem but fixing it was out of scope.

      Some kind of spawn retry loop initiated when spawn failures occur would probably mitigate the issue.

            Assignee:
            billy.donahue@mongodb.com Billy Donahue
            Reporter:
            billy.donahue@mongodb.com Billy Donahue
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: