Major - P3
Resolution: Won't Fix
This ticket is just for discussion - I think it's better to keep such kinds of discussions in Jira than in slack communications
As far as I know (I failed to find the confirmation in project wiki though) the job can be run on the machine that has already other jobs running. Which means the jobs scripts must be super careful in not changing shared state and not leaving some rubbish. This means no shared soft can be installed using package managers - only installation to local directory is safe (I guess the running directory is cleaned after the job finishes?)
I believe that lack of isolation of jobs is not a good thing and what comes to mind is Docker containers. They are super fast to start/shutdown, they provide the fully operable operational system isolated from other containers. They have their drawbacks (like necessity to expose ports etc) but they do not seem unfixable.
Of course docker containers must be scheduled on specific nodes. This is a second task - I believe it's possible to use the current scheduler for this. Ideally some orchestration platform like Kubernetes would do this perfectly:
- create a Kubernetes `Job`, specify the node affinity labels (to make sure it's scheduled to correct distro)
- Job will contain the EG job image and container that will be started automatically. There different ways to pass configuration file to the container
- Any other scheduling parameters can be taken into account (cpu, memory, storage) so Kubernetes will find the node that has enough capacity.
Again, Kubernetes can be a next step, but I think it may be possible to start from more simple scenario:
- install Docker on each node
- execute container on the host from the scheduler
- the EG job would do whatever it needs (install software etc) and then die not leaving any rubbish in the host system