Slurm
What is Slurm and who is allowed to use it
Slurm is a so called Batch System. That means you submit tasks (jobs) to Slurm and it will make sure that
these tasks get done und one or more remote machines according to predefined rules.
Slurm can be used by all students and faculty members of the Institut. No further registration is required.
There are no restrictions on when jobs can be submited or which machines can be choosen to execute the job. Keep
in mind that Slurm will start no jobs on machines with a logged in local user and jobs running on a machine where
a local user logs in will be suspended until the user logs out again.
How does Slurm work
Is there a Graphical User Interface (GUI)
Yes. The command sview opens the GUI for Slurm. The command smap is not a GUI but offers a good text based interface for Slurm.
Which machines are connected
All machines in the student computer pools of the Institute with the exception of the clients in Deneb/Terminus are connected to Slurm.
The abakus server are connected to slurm too. These have more powerfull NVIDIA cards, huge RAM size, fast/local/big SSD, many CPU cores.
What limits are there
- the number of concurrently running jobs is limited to 15 per user.
- the number of concurrently submitted jobs is limited to 30 per user.
- Access to the abakus servers can be requested by mail from the RBG. A good reason is required.
- A higher priority for ones jobs can also be requested from the RBG. A good reason is required here also.
Good reasons include but are not limited to:
- usage for ones thesis, publication or official project. A formless confirmation of this by ones thesis guide etc is required.
How to submit a job
There are two ways to submit a job to Slurm. You can do it with the cli commands srun/sbatch or via the GUI smap.
You should use a shell script to set required parameters in both cases.
Slurm supports heterogenous jobs. Alas these jobs can only be suspended and resumed as a pack.
We therefore strongly discourage the use of this job type because they have a massive disadvantage in job execution.
If one has access to multiple QOS one needs to specify which QOS is to be used on each submit via -q option. This is analog to specifing the partition via -p.
How can you modify a submited job.
Properties of submited jobs can be viewes with the command squeue/sview .
You can remove a job with scancel. Further information: “man scancel”
What do I do in case of errors
- Slurm creates a file for stdout and stderr per job per default. This file is named slurm-jobid.out. Debugging should start here. Wrong or missing Path values etc can be seen here.
- sview or squeue can show the reason why a job is still waiting.
- if jobs just disappear the RBG should be informed of this. Important infos are when it happend and which partition was used.
What kind of jobs is Slurm suited for
- Tasks that can be broken down into lots of independent smaller tasks are best suited for Slurm. E.g. each frame of a 3D Scene can be rendered independently.
- Aufgaben die sehr lange zur Berechnung benötigen. Man sollte in diesen Fall aber check-pointing verwenden, bzw. regelmäßig den Berechnungszustand abspeichern, damit man nach einer Unterbrechung an dieser Stelle weiterarbeiten kann.
- Tasks that run for a very long time. In this case please use check-pointing or write out the state of your computations periodically to be able to resume your computations after an interruption.
Comman-Line-Interface command overview
- See “man -k slurm” and “man slurm”.
- Job für die abaki Partition: “srun -q abaki -p Abaki rechenjob.sh”
Links
External Infos