If you want your job to make use of a special resource like a GPU, you will have to request these. This can be done using the new Slurm option:
#SBATCH --gpus-per-node=n
Where n
is the number of GPUs you want to use per node.
Alternatively you can request a specific GPU type using:
#SBATCH --gpus-per-node=type:n
where type
is the type of GPU. Note that it is also still possible to use the --gres
option that was required on Peregrine.
Jobs requesting GPU resources will automatically end up in one of the GPU partitions.
Node | GPU type | GPUs per node | Memory per GPU | CPUs per node | Memory per node | Slurm name |
---|---|---|---|---|---|---|
A100 | Nvidia A100 | 4 | 40 GB | 64 | 512 GB | a100 |
V100 | Nvidia V100 | 1 | 32 GB | 8 | 128 GB | v100 |
So if you would like to request two (NVIDIA A100) GPUs, you would have to use the following:
#SBATCH --gpus-per-node=a100:2
If you just want one GPU, you can leave out the type, in which case the job will get whichever GPU is available first:
#SBATCH --gpus-per-node=1
Besides the compute nodes listed above, there are two GPU nodes that can be used to test and develop your software. These machines are similar to the login and interactive nodes, and you can connect to them using the following hostname:
gpu1.hb.hpc.rug.nl gpu2.hb.hpc.rug.nl
These machines have an NVIDIA L40S GPU each, which can be shared by multiple users. The tool nvidia-smi
will show if the GPU is in use.
Please keep in mind that this is a shared machine, so allow everyone to make use of these GPUs and do not perform long runs here. Long runs should be submitted as jobs to scheduler.
You can usually request an interactive session by using a command like:
srun --gpus-per-node=1 --time=01:00:00 --pty /bin/bash
There is currently an issue with using srun –gpus-per-node
, but there is a workaround by using –gres
instead:
srun --gres=gpu:1 --time=01:00:00 --pty /bin/bash
or:
srun --gres=gpu:v100:1 --time=01:00:00 --pty /bin/bash
When the job starts running, you will be automatically logged in to the allocated node, allowing you to run your commands interactively. When you are done, just type exit
to close your interactive job and to release the allocated resources.
N.B.: interactive jobs currently don't (always) use the software stack built for the allocated nodes, you can work around this by first running unset SW_STACK_ARCH && module restore
after the job has started.