Running Python Jobs

This page covers how to submit Python jobs to the Hábrók cluster. All computationally intensive work on Hábrók must be submitted through the SLURM job scheduler rather than run directly on the login or interactive nodes. Depending on the nature of your code, there are four main ways to run Python jobs on the cluster.

Single CPU is the right choice for most Python scripts. If your code runs sequentially and does not explicitly use parallelism, a single CPU job is all you need. This is also the simplest setup and a good starting point if you are new to the cluster.

Multiple CPUs are useful when your code can be parallelised within a single node, for example using Python's multiprocessing library or other task-parallel frameworks. This is a good option for embarrassingly parallel workloads where you can split your data or tasks into independent chunks.

GPU jobs are suited for deep learning, large-scale matrix operations, and other workloads that benefit from thousands of parallel cores. Not all code benefits from a GPU — sequential code or small datasets may actually run slower due to the overhead of moving data to the GPU. If you are unsure, test on a CPU first.

Multiple nodes (MPI) is the most complex setup and should only be used when your job is too large for a single node or when you explicitly need distributed computing across nodes.

In general, start with the simplest option that fits your needs and only move to a more complex setup if you have a clear reason to do so.

For general information on how to write job scripts and allocate resources, see the Job Management page. For submitting multiple jobs of the same kind, see the page on Job arrays.

We take the following python script example, in order to run a Python script on the Hábrók cluster:

python_example.py

#!/bin/env python
import math #In order to use the square root function, Pythons math module is imported.
 
x = 2*3*7
print ("The answer of 2*3*7 = %d" % (x))
x = math.sqrt(1764)
print ("Also the square root of 1764 = %d" % (x))

And we save the file as python_example.py. Next, create a new text file that reserves resources, loads the Python module and runs the Python script. In this case the text file is called python_batch.sh.

python_batch.sh

#!/bin/bash
#SBATCH --time=00:01:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --job-name=python_example
#SBATCH --mem=800
 
module purge
module load Python/3.13.5-GCCcore-14.3.0
source $HOME/venvs/my_env/bin/activate
 
python python_example.py

Now the job is ready to be submitted to the SLURM scheduler, the following command in the terminal will do this:

sbatch python_batch.sh

An output file is created, it should contain:

The answer of 2*3*7 = 42
Also the square root of 1764 = 42

In this example we use Python's multiprocessing library to run a simple calculation in parallel across 10 CPUs. Each CPU processes one value from an array, demonstrating how to split work across multiple cores within a single node.

Requesting resources in a batch script is done as follows:

#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --nodes=1
#SBATCH --ntasks=10
#SBATCH --job-name=python_cpu
#SBATCH --mem=8000
 
module purge
module load Python/3.13.5-GCCcore-14.3.0
source $HOME/venvs/my_env/bin/activate
 
python python_cpu.py

In the Python script we create a pool of 10 threads (each for one CPU):

#!/usr/bin/env python
 
import multiprocessing 
import os # For reading the number of CPUs requested. 
import time # For clocking the calculation. 
 
def double(data):
    return data * 2
 
if __name__ == '__main__':
    begin = time.time()
    inputs = list(range(10)) # Makes an array from 0 to 10
    poolSize = int(os.environ['SLURM_JOB_CPUS_PER_NODE']) # Number of CPUs requested.
    pool = multiprocessing.Pool(processes=poolSize,)
    poolResults = pool.map(double, inputs) # Do the calculation.
    pool.close() # Stop pool accordingly.
    pool.join()  # Wrap up data from the workers in the pool.
    print ('Pool output:', poolResults) # Results.
    elapsedTime = time.time() - begin
    print ('Time elapsed for ' , poolSize, ' workers: ', elapsedTime, ' seconds')

After the execution of this job, an output file is created in which the array with the new values is printed. Also the time elapsed for this job is printed. Note that for this case it is possible to request fewer CPUs, then each will compute more than one value in this array. However, it would not make sense to request more CPUs, because there are not as many values to compute in the array (10) as there are CPUs (>10); this means some CPUs will then be left unused.

GPUs are mainly used parallel computation since they have thousands of smaller cores that can perform many operations simultaneously, compared to the tens of powerful cores in a CPU. This makes them extremely efficient for workloads that can be broken down into many independent operations running at the same time.

Tasks that benefit greatly from GPU acceleration include training and running neural networks, large-scale numerical computations, and image and signal processing. On the other hand, code that is largely sequential, has a lot of conditional branching, or works with small datasets may actually run slower on a GPU than a CPU, due to the overhead of moving data between CPU and GPU memory. On other words, make sure you use the right approach for your setup.

CUDA is NVIDIA's parallel computing platform that allows programs to run on the GPU. Most Python libraries with GPU support (like PyTorch and TensorFlow) are built on top of CUDA. In order to make use of it, you need to load a compatible CUDA module alongside your Python library, as the library needs it to communicate with the GPU hardware.

As an example, PyTorch is available as an optimised module on Hábrók with CUDA support built in. To see which versions are available:

module avail PyTorch

Load a CUDA-enabled version, for example:

module purge
module load Python/3.13.5-GCCcore-14.3.0
module load PyTorch/2.1.2-foss-2023a-CUDA-12.1.1

Before submitting a job, it is worth verifying that PyTorch can see the GPU. You can do this in an interactive session on a GPU node:

python3 -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0))"

This should print True followed by the name of the GPU. If it prints False, the most likely reason is that you are on a login or interactive node, which do not have GPUs. To test your code on a GPU you need to either submit a job to the GPU partition or start an interactive session on a GPU node (see the Login nodes page for details on the different node types available). Furthermore, see the Running jobs on GPUs page for details on requesting GPU resources and note that GPU jobs require additional SBATCH parameters compared to CPU jobs, in particular –partition=gpu and –gres=gpu:1.

Now we show an example of a typical job script for a GPU job using PyTorch. The example below is based on the PyTorch Quickstart Tutorial and shows how to define and run a simple neural network on the GPU:

pytorch_gpu.py

import torch
from torch import nn
 
# Automatically use GPU if available, otherwise fall back to CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using {device} device")
 
# Define model
class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10)
        )
 
    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits
 
model = NeuralNetwork().to(device)        # Move data to GPU
print(model)
 
# Save the model
torch.save(model.state_dict(), "model.pth")
print("Saved PyTorch model state to model.pth")

In this Python script, we move our model and data to the GPU explicitly. Using torch.device("cuda" if torch.cuda.is_available() else "cpu") is good practice as it allows the script to fall back to CPU if no GPU is available. Note that this is the correct syntax for PyTorch 2.1.2, which is the version available as a module on Hábrók. Newer versions of PyTorch (2.4 and above) introduced torch.accelerator as a more general alternative that also supports other accelerator types beyond CUDA. If you need this you would have to install a newer version of PyTorch in a virtual environment rather than using the provided module (see Python Environments). More generally, different libraries have different syntax for checking GPU availability and moving data to the device, so always check your library's documentation for the correct approach.

The corresponding job script:

pytorch_job.sh

#!/bin/bash
#SBATCH --time=01:00:00
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --ntasks=1
#SBATCH --mem=16G
#SBATCH --job-name=pytorch_gpu
 
module purge
module load Python/3.11.3-GCCcore-12.3.0
module load PyTorch/2.1.2-foss-2023a-CUDA-12.1.1
 
source $HOME/venvs/pytorch_env/bin/activate
 
python pytorch_gpu.py

This small job produces the following output, where we can see that the GPUs were successfully accessed and used:

Using cuda device
NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
  )
)

###############################################################################
Hábrók Cluster
Job 28086503 for user p324428
Finished at: Tue Mar 24 09:50:49 CET 2026

Job details:
============

Job ID                         : 28086503
Name                           : pytorch_gpu
User                           : p324428
Partition                      : gpushort
Nodes                          : v100v2gpu18
Number of Nodes                : 1
Cores                          : 8
Number of Tasks                : 1
State                          : COMPLETED  
Submit                         : 2026-03-24T09:50:30
Start                          : 2026-03-24T09:50:31
End                            : 2026-03-24T09:50:45
Reserved walltime              : 01:00:00
Used walltime                  : 00:00:14
Used CPU time                  : 00:00:11 (Efficiency:  9.42%)
% User (Computation)           : 84.90%
% System (I/O)                 : 15.09%
Total memory reserved          : 16G
Maximum memory used            : 280.28M
Requested GPUs                 : 1
Allocated GPUs                 : v100=1
Max GPU utilization            : 0%
Max GPU memory used            : 0.00 

Acknowledgements:
=================

Please see this page for information about acknowledging Hábrók in your publications:

https://wiki.hpc.rug.nl/habrok/introduction/scientific_output

################################################################################

If you are using the GPU with, say, many small image files you may notice that your jobs can take a long time to complete as the images are being read to GPU sequentially. In this case you can bypass the issue by copying your data (as an archive) to the local storage on the GPU node. To do this follow the instructions on the Many File Jobs, it describes the process in more detail.

MPI (Message Passing Interface) is a standard for running code across multiple nodes simultaneously. Each node runs a separate process and they communicate by passing data between each other. This is useful when your computation is too large to fit on a single node, or when you want to parallelise work across many nodes at once. Common use cases include large-scale simulations, parameter sweeps, and distributed data processing.

mpi4py is the Python library that provides MPI support. It is available as a module on Hábrók and is the recommended way to use MPI in Python. Loading it is straightforward — it automatically brings in a compatible Python and OpenMPI version:

module purge
module load mpi4py/4.1.0-gompi-2025a

This loads Python/3.13.1-GCCcore-14.2.0 and OpenMPI/5.0.7-GCC-14.2.0 alongside mpi4py, so you do not need to load these separately.

The example below shows a basic MPI job where an array is scattered from the master node to all nodes, each node performs a computation on its piece of data, and the results are gathered back at the master node:

python_mpi_batch.sh

#!/bin/bash
#SBATCH --time=00:05:00
#SBATCH --nodes=2
#SBATCH --ntasks=3
#SBATCH --job-name=python_mpi
#SBATCH --mem=8000
 
module purge
module load mpi4py/4.1.0-gompi-2025a
 
mpirun python ./python_mpi.py

python_mpi.py

from mpi4py import MPI
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()
 
# Scattering part.
if rank == 0:
    data = [(i+1)**2 for i in range(size)]
else:
    data = None
data = comm.scatter(data, root=0)
assert data == (rank+1)**2
 
# Check if data is scattered accordingly.
print("rank", rank, "has data:", data)
 
# Node dependent computations on data.
for i in range(size):
    if rank == i:
        data = data * rank
 
# Synchronization of the nodes.
comm.Barrier()
 
# Gathering part.
data = comm.gather(data, root=0)
if rank == 0:
    print(data)
else:
    assert data is None
quit()

Submit the job with:

sbatch python_mpi_batch.sh

The output of this script gives:

rank 0 has data: 1
[0, 4, 18]
rank 1 has data: 4
rank 2 has data: 9

###############################################################################
Hábrók Cluster
Job 28087298 for user p324428
Finished at: Tue Mar 24 10:18:38 CET 2026

Job details:
============

Job ID                         : 28087298
Name                           : python_mpi
User                           : p324428
Partition                      : regularshort
Nodes                          : omni[1,10]
Number of Nodes                : 2
Cores                          : 3
Number of Tasks                : 2
State                          : COMPLETED  
Submit                         : 2026-03-24T10:18:20
Start                          : 2026-03-24T10:18:21
End                            : 2026-03-24T10:18:34
Reserved walltime              : 00:05:00
Used walltime                  : 00:00:13
Used CPU time                  : 00:00:09 (Efficiency: 23.52%)
% User (Computation)           : 86.37%
% System (I/O)                 : 13.61%
Total memory reserved          : 16000M
Maximum memory used            : 160.61M

Acknowledgements:
=================

Please see this page for information about acknowledging Hábrók in your publications:

https://wiki.hpc.rug.nl/habrok/introduction/scientific_output

################################################################################

Note that the first three lines can be in a different order since the nodes may complete at different times. Moreover, additional job information shows that 3 CPUs are used over 2 nodes.

If you need to install additional packages alongside mpi4py, activate a virtual environment after loading the module since the Python version in the environment should match the one loaded by mpi4py (3.13.1).

Running Python Jobs

Single CPU

Multiple CPUs

GPU

CUDA

Avoiding I/O Bottlenecks

Multiple Nodes